Writing log-mcp: an MCP server for log analysis, built with Claude

I recently built log-mcp, almost entirely with Claude Code. It’s an MCP server that gives Claude tools for analyzing large log files without loading them into context. By the end it had a Rust TF-IDF classifier processing 1.3M lines/sec, an optional BERT-mini transformer on Metal GPU, and 7 analysis tools — search, compare, error grouping, classification.

  "What's wrong with this Spark log?"
                │
                ▼
  ┌──────────────────────────────────────┐
  │  Log file: 705K lines, 67 MB         │
  │  (would obliterate a context window) │
  └──────────────┬───────────────────────┘
                 │
                 ▼
  ┌──────────────────────────────────────┐
  │  Rust TF-IDF classifier              │
  │  1.3M lines/sec                      │
  │  ─────────────────────────────────── │
  │  70-95% discarded as routine         │
  │  also catches lines without ERROR    │
  │  (grep ERROR: 2, classifier: 92)     │
  └──────────────┬───────────────────────┘
                 │
                 ▼
  ┌──────────────────────────────────────┐
  │  BERT-mini (optional)                │
  │  Metal GPU, ~2K lines/sec            │
  │  ─────────────────────────────────── │
  │  re-scores LOOK lines,               │
  │  demotes false positives             │
  └──────────────┬───────────────────────┘
                 │
                 ▼
  ┌──────────────────────────────────────┐
  │  Python MCP tools                    │
  │  search, compare, group errors       │
  └──────────────┬───────────────────────┘
                 │
                 ▼
  ┌──────────────────────────────────────┐
  │  Claude                              │
  │  ─────────────────────────────────── │
  │  "34 errors in 5 groups. Top:        │
  │   shuffle fetch failures (18x),      │
  │   connection refused to slave-13."   │
  └──────────────────────────────────────┘

  67 MB log → ~3 KB in context. Nothing else touches the window.

This is the story of how it came together.

The starting question

The project started with a simple idea: what if Claude could analyze log files the way an engineer does — not by reading every line, but by scanning for what’s interesting?

Claude Code scaffolded a basic MCP server with Python. The first version had a handful of tools: log_overview, search_logs, get_log_segment, analyze_errors, log_stats, and compare_logs. It could parse timestamps, extract log levels, group errors by fingerprint, and normalize variable parts (UUIDs, IPs, numbers) so that messages differing only in request IDs would group together.

Then the question that would drive every iteration for the next five days:

Is it better than if you had to analyze these files without the MCP?

Claude’s answer was honest: “For these two small files? Not dramatically.” It could eyeball 80-line test logs just fine.

That honesty set the tone for the whole project. Every improvement had to survive the same question.

The output format pivot

I generated larger test logs — 1500 lines per file — and ran the tools again. The analysis was solid: 125 errors collapsed into 7 unique fingerprints, frequency outliers were detected across 3000 lines. But Claude’s assessment was still lukewarm:

The tools do useful work, but the output format undermines the value.

The JSON blobs were enormous. Pattern and sample fields were redundant. File paths were repeated everywhere. The tools were doing good analysis and then presenting it in the worst possible way for an LLM to consume.

Then it clicked: the consumer of these tools is an AI, not a human. JSON formatting — all those quotes, braces, commas, colons — wastes tokens. An LLM doesn’t need structured data to understand a summary.

All 6 tools got rewritten to return plain text. Error analysis went from a nested JSON blob to:

Summary: 34 errors in 5 groups.
--- 18x ---
Fingerprint: shuffle.RetryingBlockFetcher: Exception ...
First: L29764 2017-02-01T15:55:17
Last:  L30677 2017-02-01T15:55:51

After this change, Claude’s answer to the recurring question finally flipped: “Yes, I would use it now. Reading 2800 raw lines would eat ~30% of my context window.”

Real logs break everything

The first real-world test was GitHub Actions CI logs from a project of mine. The tools returned zero patterns. Complete failure.

The problem: the parser only knew 5 hardcoded timestamp formats. GitHub Actions uses tab-delimited columns with an ISO timestamp buried after a job/step prefix — build test 2026-02-24T19:01:13.5332698Z Starting tests. None of the regexes matched.

Claude designed an adaptive parser that samples lines from the file, finds timestamps anywhere in each line, and detects if the text before the timestamp shares a common structure across lines (tab-delimited columns, fixed character offsets). After implementation, all 13 CI runs parsed with 100% timestamp extraction.

Then came the real validation: one of those 13 “successful” CI runs had a Permission denied error buried at line 574 that the workflow had silently swallowed. analyze_errors would have surfaced it immediately. That’s exactly the kind of thing these tools are for — finding needles in haystacks that humans gloss over.

Content-based error detection

The CI logs exposed another gap. analyze_errors found zero errors because CI logs don’t use standard log levels. Lines like fatal: detected dubious ownership in repository and error: failed to write to file have the signal in the message content, not in a level field.

The fix was a two-pass approach: first try level-based detection, then fall back to content heuristics — regex patterns matching fatal:, Permission denied, ##[error], panic, etc. This made the tools useful on any log format, not just ones with proper log levels.

Stress testing on real data

Next I pointed the tools at real log datasets. A 10MB Zookeeper log (74K lines) compressed down to 20 patterns in one call. A 67MB Spark executor log (705K lines) — which would obliterate any context window — distilled into 5 error groups with full stack traces in seconds.

But manual investigation of the Zookeeper log revealed a blind spot. The tools caught symptoms (Connection broken, 10,356 occurrences) but missed the cause: the log told the story of ~20 leader elections over 27 days, with Exception when following the leader and rapid LOOKING/FOLLOWING/LEADING state transitions. The tools see each line independently without tracking state transitions across time.

“How did the 10MB file not pollute your context?” The key architectural insight: MCP tools act as a compression layer. The file was processed in a separate process, returning only ~3KB of summary. Manual Grep/Read would have dumped raw results directly into context.

The ML pivot

At this point I had a question: could we go further? Instead of relying on log levels and regex heuristics, could a model learn what’s semantically interesting in a log file?

The idea: use Claude to label log lines as LOOK (interesting) or SKIP (routine noise), then train a small classifier as a pre-filter. Not summarization, not explanation — just a filter. “Here are the 30 lines out of 700K that you should actually look at.”

I downloaded 16 datasets from Loghub (32K lines total), windowed them into 50-line chunks, and used Claude Haiku via the Batch API to label each line LOOK or SKIP. Cost: about $0.50.

Training a TF-IDF + logistic regression classifier on the labels gave a LOOK F1 of 0.79. Good enough for a pre-filter.

The Rust classifier

The Python classifier ran at ~90K lines/sec with multiprocessing. For a 450M-line benchmark across all 16 Loghub datasets, that’s too slow.

“Let’s add matching using the model in Rust.”

What followed was the most technically dense part of the project. Claude built a complete Rust crate with PyO3 bindings — TF-IDF vectorization, logistic regression inference, the full normalization pipeline reimplemented in Rust.

Getting exact parity with sklearn was the hardest part. The key bug: sklearn’s char_wb analyzer uses text.split() (whitespace split), not word-boundary regex. The IP address 10.0.0.1 was treated as one token by sklearn but split at word boundaries by Rust’s regex. After fixing that, parity was perfect — difference at machine epsilon (2.22e-16).

Initial speed: ~25K lines/sec. After adding Rayon for parallelism: 137K lines/sec.

“Can we squeeze to a million lines per second?”

This kicked off a deep optimization cycle. Claude profiled each step and optimized:

Normalization (16.9 → 1.9 μs/line): replaced fancy_regex with the standard regex crate for 7 of 8 patterns, wrote a custom byte-level matcher for the one pattern that needed a lookahead
Char TF-IDF (21.0 → 2.9 μs/line): byte-slice hashmap lookups instead of String allocation, a custom open-addressing hash table (HashLookup) with collision-free 64-bit hashing, flat Vec<u16> count arrays with dirty tracking, thread-local buffers
Word TF-IDF (3.7 → 0.8 μs/line): manual ASCII word scanner replacing regex
I/O pipelining: background reader thread fills 128MB chunks while the classifier processes in parallel

Some optimizations failed. An Aho-Corasick automaton for 80K char n-gram patterns created a ~50MB automaton with terrible cache behavior. A bloom filter pre-check added complexity for marginal benefit. These got discarded.

Final result: 1.3M lines/sec median, 1.5M peak. 450M lines in 325 seconds on an M3 Ultra. Error capture: 99.95% — only 18,339 lines missed out of 35.9 million ERROR/FATAL/CRITICAL lines, and those were mostly BGL’s repetitive instruction cache parity error corrected that the model correctly learned to skip as routine hardware telemetry.

The BERT stage

The TF-IDF classifier is fast but blunt. I wanted a second stage for precision. “Train a BERT-mini on the same data.” The resulting model (11.2M parameters) pushed LOOK F1 from 0.44 to 0.89.

Running BERT inference in Rust required building a transformer from scratch using the candle crate with Metal acceleration. The most entertaining bug: narrow() after CLS token extraction produced non-contiguous tensors that crashed Metal’s matmul kernel. Fixed with .contiguous().

Dynamic padding (pad to longest sequence in batch, not max 256) pushed throughput from 800 to 3,700 lines/sec on Metal GPU.

The two-stage pipeline: TF-IDF runs at 1.3M lines/sec with a relaxed threshold, then BERT re-scores only the LOOK candidates at ~2K lines/sec.

What the classifier actually sees

  Thunderbird HPC log — 2,000 lines
  ┌─────────────────────────────────────────────────────────┐
  │ sshd: session opened for user root         ░░░ SKIP     │
  │ ntpd: synchronized to 10.0.0.1             ░░░ SKIP     │
  │ kernel: AHCI port 0 reset complete         ░░░ SKIP     │
  │ crond: (root) CMD (/usr/lib64/sa/sa1)      ░░░ SKIP     │
  │ sshd: session opened for user root         ░░░ SKIP     │
  │ sendmail: unable to qualify my domain name ███ LOOK     │ ← no ERROR level
  │ sshd: session closed for user root         ░░░ SKIP     │
  │ ntpd: synchronized to 10.0.0.2             ░░░ SKIP     │
  │ dhcpd: unknown lease 10.100.4.251          ███ LOOK     │ ← no ERROR level
  │ crond: (root) CMD (/usr/lib64/sa/sa1)      ░░░ SKIP     │
  │ rrdtool: illegal attempt to update         ███ LOOK     │ ← no ERROR level
  │ ...                                                     │
  │ (1,908 lines)                              ░░░ SKIP     │
  └─────────────────────────────────────────────────────────┘
                               │
               ┌───────────────┴───────────────┐
               ▼                               ▼
        92 LOOK lines (4.6%)          1,908 SKIP lines (95.4%)
        → sent to Claude                → never touch context

        vs. grep ERROR on the same file: 2 lines

The best way to understand the value is to compare against grep ERROR on the same file. On a Thunderbird HPC log (2,000 lines):

search_logs level=ERROR — 2 lines:

error: postdrop: warning: unable to look up public/pickup: No such file
error: slurmd: error: Unable to create FIFO

classify_lines — 92 lines, including:

sendmail: unable to qualify my own domain name (tbird-sm1)
dhcpd: DHCPDISCOVER from 00:09:3d:12:00:e2 via eth2: unknown lease
rrdtool: illegal attempt to update using time 1131710721 when last update is 1131710721
pbs_mom: Bad file descriptor in tm_reply
kernel: Times: total = 42, boot = -4131

None of these have ERROR level. A DNS misconfiguration, a DHCP lease error, monitoring data corruption, a bad file descriptor, a negative boot time — all things an engineer would want to investigate, all invisible to level-based search.

On the flip side, here’s what gets correctly skipped:

sshd(pam_unix): session opened for user root by (uid=0)
ntpd: synchronized to 10.0.0.1, stratum 3
kernel: hm-ahci: AHCI 0001:01:00.0 port 0 reset complete
crond: (root) CMD (/usr/lib64/sa/sa1 1 1)

Routine session opens, time syncs, health checks, cron jobs. These lines are the 95% of log noise that buries the signal.

The Zookeeper stress test made the same point at scale: 74K lines, the classifier surfaced Cannot open channel warnings and epoch resets, while skipping thousands of routine NIOServerCnxn: Closed socket connection entries. But it also revealed the classifier’s limit — rapid cycling between LOOKING/FOLLOWING/LEADING states (a quorum instability pattern) required understanding sequences of lines across time, not individual lines in isolation.

The meta moment

At some point: “Write your own assessment of the tools for the README.” Claude produced a genuinely honest “Claude’s take” section — where the tools help, where they’re a wash, what they can’t do. “Now document known gaps in CLAUDE.md” (the file that future Claude instances read for project context).

The result: an AI tool that documents itself, files bugs against itself, and reviews itself — because the AI is both the builder and the primary consumer. The human touches two endpoints: “what’s wrong with this log?” in, plain English answer out. Everything in between is AI talking to itself.

What I learned

Claude Code is genuinely productive for ambitious projects. Over 18 Claude Code sessions, Claude wrote the vast majority of the code — Python MCP server, Rust classifier with custom hash tables and SIMD-friendly data structures, BERT training pipeline, the labeling infrastructure. My role was directing, evaluating, and pushing — “can we get to a million lines per second?”

Honest evaluation drives iteration. The recurring “is it better than reading the files directly?” question killed comfortable mediocrity. Each honest “not really” answer — and Claude gave several — pointed directly at what to fix next: output format, parser flexibility, content heuristics, ML pre-filtering.

The compression layer insight matters. The most important architectural idea isn’t any individual tool — it’s that MCP tools are a compression layer between raw data and LLM context. A 67MB log file becomes 3KB of actionable summary without ever touching the context window. This changes what’s feasible.

Format matters as much as analysis. The JSON-to-plain-text rewrite was arguably the single most impactful change. The analysis was always good; the presentation was wasting half the tokens.

The project is open source on GitHub. To try it, open a Claude Code session and paste:

Install https://github.com/ascii766164696D/log-mcp as an MCP server
and build the Rust classifier too

The starting question#

The output format pivot#

Real logs break everything#

Content-based error detection#

Stress testing on real data#

The ML pivot#

The Rust classifier#

The BERT stage#

What the classifier actually sees#

The meta moment#

What I learned#