SHIPPABLE MVP

Astroturf shows evidence where analysis exists, and creates an ingestion path where it does not.

The platform surfaces real coordinated-comment analysis for federal rulemaking, labels partial baselines honestly, and routes unsupported sectors into docket registration rather than fake product pages.

Featured Investigation

FCC docket 17-108 / landmark finding

Democratic voice is being hijacked by automated paraphrasing.

Public commenting periods are saturated by lobby campaigns using bots that subtly rewrite the same template a thousand different ways. Keyword filters miss every paraphrase. Dense vector clustering on Databricks collapses them back into one piece of actionable evidence.

Naive exact hashing

Failed to recognise paraphrases. Only surfaced literal, character-for-character copies.

Semantic clustering

1,002

Caught the full coordinated template, even when sponsors mutated synonyms and prefaces.

Detection lift

63x

1,002Largest campaign size

3Coordinated campaigns

1,017Comments in campaigns

4,993Total comments analyzed

Detected clusters

Showing all 3 clusters

1,002comments

“We need the FCC to defend the rights of millions of Internet users by upholding net neutrality protections. I stand with the millions of other Internet users who've urged the Commission to keep important net neutrality protections intact...”

Aug 28, 2017

13comments

“Net neutrality has created an unreliable landscape for consumers and businesses alike. We need Congress to bring clarity to this debate.”

Aug 28, 2017

2comments

“I urge FCC Chairman Ajit Pai to preserve real Net Neutrality under the FCC's existing rules and keep broadband internet access classified u...”

Aug 28, 2017

MVP COVERAGE

Explore analyzed coverage

Primary browsing is limited to surfaces with evidence: FCC semantic clustering and EPA exact-hash baseline results.

Live Databricks validated

Telecom & Net Neutrality

Semantic clustering and Vector Search validation for the FCC 17-108 Net Neutrality proceeding.

Agencies: FCCClusters: 3

Baseline only; semantic clustering queued

Climate / Oil & Gas / Methane

Exact-hash duplicate baseline for EPA methane comments, with semantic clustering still pending.

Agencies: EPAClusters: 7

View analyzed coverage

INGESTION ENTRY POINT

Analyze a docket

Unsupported topics become configured ingestion runs, not empty dashboards. Generate a docket registry snippet and the command sequence for regulations.gov or FCC ECFS.

Generate pipeline config

Supported source paths

These agencies can be registered through the pipeline when a reviewer has a real docket ID and scale estimate.

CFPB FTC SEC

OVERSIGHT MATRIX

Agencies with evidence

FCC has live semantic validation. EPA has baseline-only evidence. Other agencies are reachable through Analyze a docket until a run produces results.

FCCLive Databricks validated

Federal Communications Commission

Dockets: 1Comments: 4,993

EPABaseline only

Environmental Protection Agency

Dockets: 1

BENCHMARK PROOF

Exact-hash baselines vs. semantic models

The MVP keeps partial analysis visible without overstating it, and uses Databricks where local pairwise clustering hits a hard scaling wall.

Naive Hashing vs. Dense Semantic Connected Components

Naive duplicate checks fail on customized campaigns. The comparative evaluation on FCC Proceeding **17-108** demonstrates that paraphrasing represents the vast majority of coordinated lobby campaign volume.

Exact duplicate baseline318 filings

Surfaced **16** rigid literal copy groups. Missed all comments containing typos, custom prefaces, or synonym rephrasings.

Covered: 6.4%Uncovered: 93.6%

Astroturf semantic clustering1017 filings

Surfaced **3** massive cohesive campaigns. Consolidated near-duplicates and paraphrased templates into a unified medoid.

Covered: 20.4%Uncovered: 79.6%

The infrastructure

Why Databricks is load-bearing.

Each of the six agents leans on a specific Databricks capability. Pull any of these out and the pipeline either stops scaling, stops being reproducible, or stops being safe to put in front of a regulator.

Delta Lake + Unity Catalog

Interrupted ingestion / re-run drift

Every one of the six agents writes through Delta MERGE on a stable primary key. ACID transactions guarantee that 87-minute ECFS slices, mid-rate-limit retries, and partial-failure replays all converge to the same idempotent bronze/silver/gold tables. Unity Catalog adds column-level RBAC for PII isolation.

Foundation Model API

PIPELINE FLOW

The Multi-Agent Medallion Sequence

Astroturf processes data through six independent, idempotent agents. Delta Lake tables serve as the durable state machine connecting agents, while MLflow tracks run provenance.

IngestionAgent

Multi-Source Puller

Fetches comments via ECFS/Regulations.gov APIs. Overwrites raw comments onto bronze Delta tables on unique keys.

ParserAgent

Metadata Extractor

Segregates inline bodies, enriches comment details, catalogs scanned attachment binaries, and flags boilerplate covers.

EmbeddingAgent

Vectorizer Node

Distributes batch text blocks across Spark nodes. Encodes comments into 1024-dim dense vectors using BGE-large.

ClusteringAgent

Vector Search Solver

Triggers distributed Vector Search (HNSW) nearest-neighbor indexes. Performs cosine grouping above a stable threshold (0.92).

RIGOR & TRANSPARENCY

Methodological Bounds & Limitations

An honest representation of AI data pipelines requires clear documentation of scientific limits. Here are the primary analytical constraints of the current Astroturf iteration:

1. Temporal Horizon Slicing

Our active local case study evaluates public comments submitted within a narrow 3-day window (August 28 to August 30, 2017). Coordinated campaign waves are wider; this temporal slice efficiently captures the major filing burst but underrepresents the full absolute campaign volumes.

Bound: Data Scope Slice

2. Cosine Threshold Sensitivity

The clustering agent operates under a fixed cosine similarity threshold of **`0.92`** over BGE embeddings. While this is highly optimized, citizens who add extensive personal paragraphs or heavily customize prefaces will fall below this threshold (false negatives), showing that coordination is a spectrum.

Bound: Semantic Cutoff Bound

3. Astroturf vs. Allowed Advocacy

The system groups highly similar template text, but semantic grouping alone cannot distinguish permitted civic bulk advocacy (e.g. authorized petitions compiled by advocacy groups) from malicious identity hijacking without checking external lobby registries and authorization audits.

Bound: Intent Attribution Bound

The system

Six agents over a medallion lakehouse.

Each transition between tiers is an idempotent agent with its own contract. Databricks features are called out at every touch point.

The pipeline

How it works

Ingest
Public comments are pulled from two federal sources: regulations.gov v4 (CFPB, EPA, FTC, FDA, ...) and the FCC ECFS public API (telecom dockets). Cursor-based pagination, exponential-backoff retries, and a shared api.data.gov rate-limit budget. Data lands in a Delta Lake bronze table on Databricks Unity Catalog with full provenance, idempotent re-runs, and an MLflow run per ingestion.
Parse and enrich
Comments are normalized through a medallion architecture (bronze to silver). HTML cleaning, attachment cataloging, and detail-level enrichment run as a Databricks Workflow, source-aware: ECFS rows skip the detail-fetch round-trip because their bodies are already plain text, while regulations.gov rows fan out per-comment detail requests under the rate-limit budget.
Embed
Each comment is converted to a 1024-dimension semantic embedding via databricks-bge-large-en, served through the Databricks Foundation Model API. Embeddings are written to a Delta table and synced to a Databricks Vector Search index.
Cluster
A two-stage clusterer collapses the comparison space: on token shingles generates candidate pairs; cosine similarity over the Vector Search index confirms semantic neighbors above a tunable threshold (default 0.92). Cluster assignments and representative templates land in gold Delta tables, joined with cluster sizes and date spans for the UI.

Sample Size (N)	Required Float32 RAM	Single-Node Status
1,000 comments	4 MB	\r\n Local Safe
5,000 comments	100 MB	\r\n Local Safe (Capped)
10,000 comments	400 MB	\r\n Boundary / Slow
100,000 comments	40 GB	!\r\n OOM Crash / Out of Memory
1,000,000+ comments	4 TB	!\r\n Physically Impossible Locally

Astroturf shows evidence where analysis exists, and creates an ingestion path where it does not.

Democratic voice is being hijacked by automated paraphrasing.

Detected clusters

Explore analyzed coverage

Telecom & Net Neutrality

Climate / Oil & Gas / Methane

Analyze a docket

Supported source paths

Agencies with evidence

Federal Communications Commission

Environmental Protection Agency

Exact-hash baselines vs. semantic models

Naive Hashing vs. Dense Semantic Connected Components

Why Databricks is load-bearing.

Delta Lake + Unity Catalog

Foundation Model API

The Multi-Agent Medallion Sequence

IngestionAgent

ParserAgent

EmbeddingAgent

ClusteringAgent

Methodological Bounds & Limitations

1. Temporal Horizon Slicing

2. Cosine Threshold Sensitivity

3. Astroturf vs. Allowed Advocacy

Six agents over a medallion lakehouse.

How it works

Ingest

Parse and enrich

Embed

Cluster

The O(N^2) Memory Wall: Why Local Clustering Fails

How the Coordinated Campaign Hides Itself

Eleanor Vance

Gregory House

Robert Chase

Vector Search

Workflows / Jobs

MLflow audit trails

Databricks SQL Connector

AttributionAgent

MigrationAgent

Attribute and trace

Serve