Astroturf shows evidence where analysis exists, and creates an ingestion path where it does not.
The platform surfaces real coordinated-comment analysis for federal rulemaking, labels partial baselines honestly, and routes unsupported sectors into docket registration rather than fake product pages.
Democratic voice is being hijacked by automated paraphrasing.
Public commenting periods are saturated by lobby campaigns using bots that subtly rewrite the same template a thousand different ways. Keyword filters miss every paraphrase. Dense vector clustering on Databricks collapses them back into one piece of actionable evidence.
Naive exact hashing
16
Failed to recognise paraphrases. Only surfaced literal, character-for-character copies.
Semantic clustering
1,002
Caught the full coordinated template, even when sponsors mutated synonyms and prefaces.
Detection lift
63x
Explore analyzed coverage
Primary browsing is limited to surfaces with evidence: FCC semantic clustering and EPA exact-hash baseline results.
Analyze a docket
Unsupported topics become configured ingestion runs, not empty dashboards. Generate a docket registry snippet and the command sequence for regulations.gov or FCC ECFS.
Generate pipeline configExact-hash baselines vs. semantic models
The MVP keeps partial analysis visible without overstating it, and uses Databricks where local pairwise clustering hits a hard scaling wall.
Naive Hashing vs. Dense Semantic Connected Components
Naive duplicate checks fail on customized campaigns. The comparative evaluation on FCC Proceeding **17-108** demonstrates that paraphrasing represents the vast majority of coordinated lobby campaign volume.
Surfaced **16** rigid literal copy groups. Missed all comments containing typos, custom prefaces, or synonym rephrasings.
Surfaced **3** massive cohesive campaigns. Consolidated near-duplicates and paraphrased templates into a unified medoid.
The infrastructure
Why Databricks is load-bearing.
Each of the six agents leans on a specific Databricks capability. Pull any of these out and the pipeline either stops scaling, stops being reproducible, or stops being safe to put in front of a regulator.
Delta Lake + Unity Catalog
Interrupted ingestion / re-run drift
Every one of the six agents writes through Delta MERGE on a stable primary key. ACID transactions guarantee that 87-minute ECFS slices, mid-rate-limit retries, and partial-failure replays all converge to the same idempotent bronze/silver/gold tables. Unity Catalog adds column-level RBAC for PII isolation.
Foundation Model API
The Multi-Agent Medallion Sequence
Astroturf processes data through six independent, idempotent agents. Delta Lake tables serve as the durable state machine connecting agents, while MLflow tracks run provenance.
IngestionAgent
Multi-Source PullerFetches comments via ECFS/Regulations.gov APIs. Overwrites raw comments onto bronze Delta tables on unique keys.
ParserAgent
Metadata ExtractorSegregates inline bodies, enriches comment details, catalogs scanned attachment binaries, and flags boilerplate covers.
EmbeddingAgent
Vectorizer NodeDistributes batch text blocks across Spark nodes. Encodes comments into 1024-dim dense vectors using BGE-large.
ClusteringAgent
Vector Search SolverTriggers distributed Vector Search (HNSW) nearest-neighbor indexes. Performs cosine grouping above a stable threshold (0.92).
Methodological Bounds & Limitations
An honest representation of AI data pipelines requires clear documentation of scientific limits. Here are the primary analytical constraints of the current Astroturf iteration:
1. Temporal Horizon Slicing
Our active local case study evaluates public comments submitted within a narrow 3-day window (August 28 to August 30, 2017). Coordinated campaign waves are wider; this temporal slice efficiently captures the major filing burst but underrepresents the full absolute campaign volumes.
2. Cosine Threshold Sensitivity
The clustering agent operates under a fixed cosine similarity threshold of **`0.92`** over BGE embeddings. While this is highly optimized, citizens who add extensive personal paragraphs or heavily customize prefaces will fall below this threshold (false negatives), showing that coordination is a spectrum.
3. Astroturf vs. Allowed Advocacy
The system groups highly similar template text, but semantic grouping alone cannot distinguish permitted civic bulk advocacy (e.g. authorized petitions compiled by advocacy groups) from malicious identity hijacking without checking external lobby registries and authorization audits.
The system
Six agents over a medallion lakehouse.
Each transition between tiers is an idempotent agent with its own contract. Databricks features are called out at every touch point.
The pipeline
How it works
Ingest
Public comments are pulled from two federal sources: regulations.gov v4 (CFPB, EPA, FTC, FDA, ...) and the FCC ECFS public API (telecom dockets). Cursor-based pagination, exponential-backoff retries, and a shared
api.data.govrate-limit budget. Data lands in a Delta Lake bronze table on Databricks Unity Catalog with full provenance, idempotent re-runs, and an MLflow run per ingestion.Parse and enrich
Comments are normalized through a medallion architecture (bronze to silver). HTML cleaning, attachment cataloging, and detail-level enrichment run as a Databricks Workflow, source-aware: ECFS rows skip the detail-fetch round-trip because their bodies are already plain text, while regulations.gov rows fan out per-comment detail requests under the rate-limit budget.
Embed
Each comment is converted to a 1024-dimension semantic embedding via
databricks-bge-large-en, served through the Databricks Foundation Model API. Embeddings are written to a Delta table and synced to a Databricks Vector Search index.Cluster
A two-stage clusterer collapses the comparison space: on token shingles generates candidate pairs; cosine similarity over the Vector Search index confirms semantic neighbors above a tunable threshold (default 0.92). Cluster assignments and representative templates land in gold Delta tables, joined with cluster sizes and date spans for the UI.