I audited this site for semantic HTML today. It’s mostly good. But the gaps are interesting, because they’re exactly the things an AI crawler trips over.
What “semantic” actually means
Semantic HTML is just using the right tag for the job.
A button does something, so it’s a <button>. A list of links is a <ul> of <a> tags, not a stack of styled <div>s. A blog post is an <article>. A publish date goes inside <time datetime="2026-04-28">, not a plain <p>.
None of this is new. The web platform has had these tags for years. What’s new is who’s reading them.
Who reads HTML in 2026
A few years ago, the answer was: browsers, screen readers, and search crawlers.
Now it’s also: ChatGPT pulling a page to answer a question. Claude reading a doc you pasted in. Perplexity citing your blog. An agent your friend wrote that summarizes ten articles before lunch.
These readers don’t see your CSS. They see the DOM.
If your “publish date” is a <p class="post-meta">, an LLM has to guess. If it’s <time datetime="2026-04-28">, the LLM doesn’t have to guess. One of those is reliable. The other is a coin flip that gets worse the further the model is from your specific class names.
What I found on this site
I ran an audit across every .astro file. Here’s the short version:
What’s already working:
<html lang="en">is set- The header is a
<header>with a<nav aria-label="Primary"> - Blog posts use
<article>with a<header>containing the title and metadata - Lists are real
<ul>s, including the social links and the timeline - Images have meaningful
alttext - The contact modal is a real
<dialog>, not a stack of divs pretending
What’s missing:
- Publish dates are plain text, not
<time datetime="...">. An LLM reading the post page has to parse “2026-03-18” from prose to know when it was written. - No JSON-LD structured data. No
BlogPostingschema, noPersonschema. A model that wants to know who wrote the post and when has to infer it from the layout. - No Open Graph or Twitter card meta. So when somebody pastes a link into a chat, there’s no clean preview, and tools that cache page summaries fall back to scraping the title tag.
None of these are bugs. The site renders fine. But each one is a place where a machine has to guess, and guessing is where hallucinations start.
The cheap fixes
Most of this is one tag at a time.
<!-- before -->
<p class="post-meta">{frontmatter.date} • {frontmatter.readTime}</p>
<!-- after -->
<p class="post-meta">
<time datetime={frontmatter.date}>{frontmatter.date}</time>
• {frontmatter.readTime}
</p>
That’s it. Same visible output. A model reading the page now has a typed date, not a sentence.
JSON-LD is similar. One <script type="application/ld+json"> block per post page, generated from frontmatter you already have.
Why I care
Half of my own reading happens through AI tools now. I ask Claude to summarize a post. I let an agent collect five articles before I read any of them.
When my own site is on the other side of that pipe, I’d rather it parse cleanly than be one of the pages where the model gets the date wrong.
Semantic HTML costs almost nothing. It’s the same code, with better names. The payoff is that humans, screen readers, search engines, and now LLMs all read the same page and reach the same conclusions.
That’s a good deal.