Valtik Studios
Back to blog
ConsumerinfoUpdated 2026-04-2010 min

The Wayback Machine Is Going Dark in 2026

241 news outlets now block the Internet Archive's crawlers. Reddit cut it off in August 2025. The New York Times added archive.org_bot to robots.txt at the end of 2025. Cloudflare blocks AI crawlers by default as of July 2025. Google Cache is gone. Bing Cache is gone. The open record of the web is narrowing fast, and that matters for journalists, OSINT operators, and bug bounty researchers who need archival evidence. Practical alternatives and a local-capture pipeline included.

Phillip (Tre) Bucchi headshot
Phillip (Tre) Bucchi·Founder, Valtik Studios. Penetration Tester

Founder of Valtik Studios. Penetration tester. Based in Connecticut, serving US mid-market.

# The Wayback Machine Is Going Dark in 2026, and Almost Nobody Is Talking About It

You click a link from a 2021 tweet. The target page is gone. You paste the URL into the Wayback Machine. Blocked by robots.txt. You try archive.today. The page loads. A 2026 copy, not the 2021 version you needed. The record is gone.

This is not a bug. It is the plan.

The shift, 2023 to 2026

The Internet Archive has been the web's long-term memory since 1996. For most of that time, its deal with publishers was implicit and boring. A crawler visits, saves a snapshot, moves on. Search engines did the same thing. Nobody cared.

That deal broke in 2023. AI training was the trigger. Publishers saw OpenAI, Anthropic, Google, and Perplexity scraping their content to train models that then competed with them for readers. They reached for robots.txt. First they blocked GPTBot, Google-Extended, ClaudeBot, CCBot. Then they noticed that the Wayback Machine served those same pages back to anyone who asked, including bot operators who rotated user agents. So they blocked the Wayback Machine too.

As of early 2026, 241 news outlets across nine countries block at least one Internet Archive crawler in their robots.txt files. Most of them target archive.org_bot and ia_archiver-web.archive.org. Gannett (USA Today's parent, the largest US newspaper publisher) accounts for roughly 87% of the blocked sites. That single policy decision removed hundreds of local US papers from the archival record in one move. [1]

The New York Times added archive.org_bot to its robots.txt at the end of 2025, with a spokesperson saying the Times is blocking because the Wayback Machine "provides unfettered access to Times content, including by AI companies, without authorization." [2] Reddit followed in August 2025. The Wayback Machine can now only crawl Reddit's homepage. Comments, post detail pages, subreddit pages, and user profiles are off the archival record. [3]

Upstream, the pipes themselves changed. On July 1, 2025, Cloudflare became the first major infrastructure provider to block AI crawlers by default for every new domain. That default does not explicitly target the Internet Archive, but Cloudflare's bot management fingerprints archival crawlers by behavior too, and any site on Cloudflare's default configuration now presents a harder surface to any scraper that does not present a verified user agent. [4]

Meanwhile the fallbacks died. Google retired the cache: operator in early 2024 and quietly removed the cache link from search results by September 2024. [5] Microsoft removed Bing's cache link in December 2024. [6] The options for seeing "what a page said last week" are narrowing fast.

What the blockers say

Three stated reasons. None of them hold up to the way the Archive actually works.

AI scraping. This is the one they lead with. The argument is that AI companies bypass robots.txt on the original site by fetching the Wayback Machine's copy instead. This is partly true. It is also partly theater. Common Crawl's CCBot is the most-blocked crawler among top news sites (around 70 to 75% block rate), and a November 2025 investigation showed Common Crawl shipped millions of paywalled articles to OpenAI and others regardless of robots.txt. [7] Blocking the Archive does not stop AI training. It just removes the historical record.

Copyright. The Internet Archive lost Hachette v. Internet Archive in September 2024 when the Second Circuit ruled that its Controlled Digital Lending program was not fair use. That case was about lending scanned books, not the Wayback Machine, but publishers blurred the lines on purpose. Over 500,000 books came off the lending platform as a result. [8] The record labels (UMG, Sony, Capitol) sued over the Great 78 Project for $400M in 2023, then quietly settled in 2025 for undisclosed terms. [9] The pattern is clear. Archives are expensive to defend.

Bandwidth and abuse. This one has some merit. In October 2024 the Archive was hit by a DDoS from the pro-Palestinian hacktivist group BlackMeta and a separate breach that exposed 31 million user records (email, screen name, bcrypt hash). [10] A second breach through its Zendesk instance followed on October 20. The Archive has been slower and spottier since. Publishers cite this as a reason to block, but most of the real bandwidth pressure comes from AI scrapers, not archive crawlers.

The cost nobody prices in

Archives are infrastructure for accountability. That statement sounds abstract until you try to do any of the following:

  • Prove a phishing page said what it said when you reported it three months ago.
  • Show that a vendor's marketing page claimed a security certification before the breach and quietly dropped the claim after.
  • Document a CVE timeline against the actual page text at the time of disclosure.
  • Find the original wording of a product terms of service before the policy was revised.
  • Fact check a public official against a deleted campaign statement.
  • Cite a webpage in a law review article and expect the cite to resolve five years later.

When the Wayback Machine is blocked from a domain, all of that becomes dependent on whether you personally saved a copy at the right moment. The record becomes private, unevenly held, and easy to dispute.

Journalists lose the most visible slice. The EFF made this point in March 2026: blocking the Archive will not stop AI training, but it will erase the web's historical record. [11] Link rot on news sites is already brutal. Stories get rewritten after publication without a correction notice. Headlines change. The "stealth edit" is a known pattern. The Archive was the check on that pattern. Take it away and every editorial change happens in the dark.

Security research loses the less visible but arguably deeper slice. Bug bounty recon tools like waybackurls extract historical URLs for a domain from the Archive. That is how researchers find forgotten endpoints, deprecated API versions, leaked keys in old query strings, stale JavaScript files with sensitive comments. When the Archive has nothing for a target, this technique returns nothing. Your recon gets shallower. Vendors love that. It favors the defender of record and punishes external scrutiny.

Phishing investigators use the Archive to prove what a malicious page displayed at the time it was reported, because the attacker usually tears the page down within hours. No archive, no evidence. You are back to screenshots, which adversaries argue are forgeable.

There is a second-order effect that should make anyone in democracy-adjacent work nervous. If the public record of the web is held only by publishers who can edit it freely, the concept of "what the internet said yesterday" concentrates into the hands of the entities with the most incentive to revise it. That is a one-way trip. Once context is gone, you cannot go back and get it.

GDPR makes this worse, not better. The right to erasure under Article 17 pushes against archival preservation. The exceptions (journalism, public interest, academic research) exist but are uneven in practice, and the Archive still has to handle takedown requests one by one. [12] It is the right of powerful parties to quietly delete their history, dressed up as privacy.

Practical alternatives for operators

If you do security, OSINT, journalism, or anything involving a chain of custody on web evidence, you cannot rely on the Wayback Machine alone anymore. Here is what actually works in 2026.

archive.today (archive.ph / archive.is / archive.md). Single-operator service, capture-on-demand, JavaScript rendering. Ignores robots.txt. The functional replacement for Google Cache. Trade-off: single point of failure, no bulk API, no time-travel for pages that were not captured before you needed them. Use it the moment you see something you might need later. [13]

Ghostarchive. Uses the Webrecorder suite, strong on JavaScript-heavy pages including X/Twitter and YouTube. Good complement to archive.today when you need dynamic content preserved. [14]

Perma.cc. Harvard's citation-grade archive, used by over 150 journals and courts. Requires an account through a participating institution. Pages are cryptographically tied to a permanent ID. This is what you want for any link you plan to cite in a formal document, a CVE write-up, or a disclosure report. [15]

Memento Time Travel. Aggregator across the Wayback Machine, archive.today, Perma, and others. If one archive has the snapshot, Memento finds it. Use it as your first-pass search when you do not know which archive has what you need.

Common Crawl. Not a real-time archive, but a quarterly snapshot of the open web going back a decade. If you need the headers, body, or link structure of a page as it existed in a past crawl window, Common Crawl has it, and the WARC files are downloadable. This is the only option that scales to bulk analysis. [16]

Your own capture pipeline. This is the one nobody wants to hear and everyone eventually needs. Local archives under your control, captured at the moment you decide the page matters, in a format that preserves the DOM, resources, and timestamps. The stack we run is headless Chromium driven by Puppeteer or Playwright, calling Chrome DevTools Protocol's Page.captureSnapshot method, which returns a single-file MHTML archive containing HTML, CSS, and referenced resources. [17] Tools like monolith produce similar single-file outputs. Combine with a wget --mirror of the domain if you need the full site, and WARC writers like pywb or wget --warc-file if you want replay-compatible archives.

For bug bounty work specifically, capture targets once at start of engagement, then again before any disclosure or public write-up. Hash the capture. That file is your evidence if the vendor changes the page.

The longer arc

If the current trend holds, 2026 is the year the Wayback Machine becomes unreliable by default for any major publisher. 2027 is the year it happens to medium-sized sites as Cloudflare's managed robots.txt and similar CDN features get turned on widely. 2028 is when the failure modes become obvious to the public. Someone cites a study, the cited URL goes dead, no archive has it, and the study quietly loses its footnote.

The Archive itself is not going away. Brewster Kahle will keep it running as long as he can raise money. But its coverage of the open web is going to look less like a near-complete record and more like a Swiss cheese of volunteer-accessible domains plus whatever older material was already captured. Every blocked domain today means a shorter memory tomorrow.

The longer arc assumes nobody fights it. That is the default. Fighting it looks like funding the Archive, running local capture pipelines, pressuring CDNs to treat archival crawlers differently from AI training crawlers, and pushing legislative carve-outs for preservation. None of those are hot takes. All of them are work.

What Valtik does about it

We maintain local archives of every research target we touch. A headless Chromium MHTML capture on first recon, again before any write-up, again at disclosure. The captures live on our VPS. If we end up referencing the target in a CVE advisory, a bug bounty report, or a blog post, the capture is what we rely on, not the Archive. The Archive is a nice-to-have now. It was the primary source. It is not anymore.

---

Sources

  1. Nieman Journalism Lab, "News publishers limit Internet Archive access due to AI scraping concerns," January 2026. https://www.niemanlab.org/2026/01/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/
  2. Tom's Hardware, "News outlets are blocking Wayback Machine from archiving their pages," 2026. https://www.tomshardware.com/tech-industry/big-tech/news-outlets-are-blocking-wayback-machine-from-archiving-their-pages-23-outlets-concerned-ai-companies-might-abuse-fair-use-and-use-it-to-train-their-models
  3. 9to5Mac, "Reddit blocks non-profit Wayback Machine from archiving the site," August 12, 2025. https://9to5mac.com/2025/08/12/reddit-blocks-non-profit-wayback-machine-from-archiving-the-site/
  4. Cloudflare, "Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large," press release, July 1, 2025. https://www.cloudflare.com/press/press-releases/2025/cloudflare-just-changed-how-ai-crawlers-scrape-the-internet-at-large/
  5. Search Engine Land, "Google Search officially retires cache link," February 2, 2024. https://searchengineland.com/google-search-officially-retires-cache-link-437122
  6. Search Engine Land, "Bing officially removes cache link from search results," December 2024. https://searchengineland.com/bing-officially-removes-cache-link-from-search-results-449220
  7. BuzzStream, "Which News Sites Block AI Crawlers in 2025? [New Data]." https://www.buzzstream.com/blog/publishers-block-ai-study/
  8. Internet Archive Blog, "End of Hachette v. Internet Archive," December 4, 2024. https://blog.archive.org/2024/12/04/end-of-hachette-v-internet-archive/
  9. Music Ally, "Labels and Internet Archive hope to settle legal battle soon," April 10, 2025. https://musically.com/2025/04/10/labels-and-internet-archive-hope-to-settle-legal-battle-soon/
  10. BleepingComputer, "Internet Archive hacked, data breach impacts 31 million users," October 2024. https://www.bleepingcomputer.com/news/security/internet-archive-hacked-data-breach-impacts-31-million-users/
  11. EFF, "Blocking the Internet Archive Won't Stop AI, But It Will Erase the Web's Historical Record," March 2026. https://www.eff.org/deeplinks/2026/03/blocking-internet-archive-wont-stop-ai-it-will-erase-webs-historical-record
  12. GDPR Article 17, Right to Erasure. https://gdpr-info.eu/art-17-gdpr/
  13. Archive.today, Archive Team wiki. https://wiki.archiveteam.org/index.php/Archive.today
  14. Ghostarchive. https://ghostarchive.org/
  15. Perma.cc, Harvard Library Innovation Lab. https://perma.cc/about
  16. Common Crawl FAQ. https://commoncrawl.org/faq
  17. Chrome DevTools Protocol, Page.captureSnapshot (MHTML). Used via Puppeteer or Playwright for local evidence capture.
  18. Internet Archive Blog, "Robots.txt meant for search engines don't work well for web archives," April 17, 2017. https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
  19. The Intercept, "New York Times Doesn't Want Its Website Archived," September 17, 2023. https://theintercept.com/2023/09/17/new-york-times-website-internet-archive/
  20. Techdirt, "News Publishers Are Now Blocking The Internet Archive, And We May All Regret It," February 13, 2026. https://www.techdirt.com/2026/02/13/news-publishers-are-now-blocking-the-internet-archive-and-we-may-all-regret-it/
wayback machineinternet archiveosintweb archivingai scrapingrobots.txtcloudflarebug bountyjournalismresearch

Want us to check your Consumer setup?

Our scanner detects this exact misconfiguration. plus dozens more across 38 platforms. Free website check available, no commitment required.

Get new research in your inbox
No spam. No newsletter filler. Only new posts as they publish.