Tags: server

111

sparkline

Monday, April 7th, 2025

Denial

The Wikimedia Foundation, stewards of the finest projects on the web, have written about the hammering their servers are taking from the scraping bots that feed large language models.

Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

Drew DeVault puts it more bluntly, saying Please stop externalizing your costs directly into my face:

Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.

And no, a robots.txt file doesn’t help.

If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned.

Free and open source projects are particularly vulnerable. FOSS infrastructure is under attack by AI companies:

LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.

You try to do the right thing by making knowledge and tools freely available. This is how you get repaid. AI bots are destroying Open Access:

There’s a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.

My own experience with The Session bears this out.

Ars Technica has a piece on this: Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries .

So does MIT Technology Review: AI crawler wars threaten to make the web more closed for everyone.

When we talk about the unfair practices and harm done by training large language models, we usually talk about it in the past tense: how they were trained on other people’s creative work without permission. But this is an ongoing problem that’s just getting worse.

The worst of the internet is continuously attacking the best of the internet. This is a distributed denial of service attack on the good parts of the World Wide Web.

If you’re using the products powered by these attacks, you’re part of the problem. Don’t pretend it’s cute to ask ChatGPT for something. Don’t pretend it’s somehow being technologically open-minded to continuously search for nails to hit with the latest “AI” hammers.

If you’re going to use generative tools powered by large language models, don’t pretend you don’t know how your sausage is made.

Friday, March 28th, 2025

Open source devs say AI crawlers dominate traffic, forcing blocks on entire countries - Ars Technica

As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.

Wednesday, March 26th, 2025

Go To Hellman: AI bots are destroying Open Access

AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet.

Friday, March 21st, 2025

FOSS infrastructure is under attack by AI companies

More on how large language bots are DDOSing the web:

LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.

Thursday, March 20th, 2025

Please stop externalizing your costs directly into my face

Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale.

This matches my experience with The Session. In fact, while I had this article open in a tab, I had to go deal with a tsunami of large language model bots. It’s really fucking depressing.

Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop. If blasting CO2 into the air and ruining all of our freshwater and traumatizing cheap laborers and making every sysadmin you know miserable and ripping off code and books and art at scale and ruining our fucking democracy isn’t enough for you to leave this shit alone, what is?

Tuesday, January 21st, 2025

Moving on from React, a Year Later

Many interactions are not possible without JavaScript, but that doesn’t mean we should look to write more than we have to. The server doing something useful is a requirement for building an interesting business. The client doing something is often a nice-to-have.

There’s also this:

It’s really fast

One of the arguments for a SPA is that it provides a more reactive customer experience. I think that’s mostly debunked at this point, due to the performance creep and complexity that comes in with a more complicated client-server relationship.

Saturday, October 19th, 2024

My solar-powered and self-hosted website | Dries Buytaert

This is a neat project form Dries:

This project is driven by my curiosity about making websites and web hosting more environmentally friendly, even on a small scale. It’s also a chance to explore a local-first approach: to show that hosting a personal website on your own internet connection at home can often be enough for small sites. This aligns with my commitment to both the Open Web and the IndieWeb.

At its heart, this project is about learning and contributing to a conversation on a greener, local-first future for the web.

Tuesday, May 21st, 2024

Speculation rules

There’s a new addition to the latest version of Chrome called speculation rules. This already existed before with a different syntax, but the new version makes more sense to me.

Notice that I called this an addition, not a standard. This is not a web standard, though it may become one in the future. Or it may not. It may wither on the vine and disappear (like most things that come from Google).

The gist of it is that you give the browser one or more URLs that the user is likely to navigate to. The browser can then pre-fetch or even pre-render those links, making that navigation really snappy. It’s a replacement for the abandoned link rel="prerender".

Because this is a unilateral feature, I’m not keen on shipping the code to all browsers. The old version of the API required a script element with a type value of “speculationrules”. That doesn’t do any harm to browsers that don’t support it—it’s a progressive enhancement. But unlike other progressive enhancements, this isn’t something that will just start working in those other browsers one day. I mean, it might. But until this API is an actual web standard, there’s no guarantee.

That’s why I was pleased to see that the new version of the API allows you to use an external JSON file with your list of rules.

I say “rules”, but they’re really more like guidelines. The browser will make its own evaluation based on bandwidth, battery life, and other factors. This feature is more like srcset than source: you give the browser some options, but ultimately you can’t force it to do anything.

I’ve implemented this over on The Session. There’s a JSON file called speculationrules.js with the simplest of suggestions:

{
  "prerender": [{
    "where": {
        "href_matches": "/*"
    },
    "eagerness": "moderate"
  }]
}

The eagerness value of “moderate” says that any link can be pre-rendered if the user hovers over it for 200 milliseconds (the nuclear option would be to use a value of “immediate”).

I still need to point to that JSON file from my HTML. Usually this would be done with something like a link element, but for this particular API, I can send a response header instead:

Speculation-Rules: “/speculationrules.json"

I like that. The response header is being sent to every browser, regardless of whether they support speculation rules or not, but at least it’s just a few bytes. Those other browsers will ignore the header—they won’t download the JSON file.

Here’s the PHP I added to send that header:

header('Speculation-Rules: "/speculationrules.json"');

There’s one extra thing I had to do. The JSON file needs to be served with mime-type of “application/speculationrules+json”. Here’s how I set that up in the .conf file for The Session on Apache:

<IfModule mod_headers.c>
  <FilesMatch "speculationrules.json">
    Header set Content-type application/speculationrules+json
   </FilesMatch>
</IfModule>

A bit of a faff, that.

You can see it in action on The Session. Open up Chrome or Edge (same same but different), fire up the dev tools and keep the network tab open while you navigate around the site. Notice how hovering over a link will trigger a new network request. Clicking on that link will get you that page lickety-split.

Mind you, in the case of The Session, the navigations were already really fast—performance is a feature—so it’s hard to guage how much of a practical difference it makes in this case, but it still seems like a no-brainer to me: taking a few minutes to add this to your site is worth doing.

Oh, there’s one more thing to be aware of when you’re implementing speculation rules. You have the option of excluding URLs from being pre-fetched or pre-rendered. You might need to do this if you’ve got links for adding items to shopping carts, or logging the user out. But my advice would instead be: stop using GET requests for those actions!

Most of the examples given for unsafe speculative loading conditions are textbook cases of when not to use links. Links are for navigating. They’re indempotent. For everthing else, we’ve got forms.

Monday, April 29th, 2024

HTMX Is So Cool I Rolled My Own! – David Bushell – Freelance Web Design (UK)

Call it HTMLX or call it Hijax, what matters isn’t the code so much as the idea:

Front-end JavaScript fatigue is real. I’m guilty myself of over-engineering JS UI despite preferring good old server templates. I don’t even think HTMX is that good but the philosophy behind it embarrasses the modern JavaScript developer. For that I appreciate it very much.

Monday, February 12th, 2024

Federation syndication

I’m quite sure this is of no interest to anyone but me, but I finally managed to fix a longstanding weird issue with my website.

I realise that me telling you about a bug specific to my website is like me telling you about a dream I had last night—fascinating for me; incredibly dull for you.

For some reason, my site was being brought to its knees anytime I syndicated a note to Mastodon. I rolled up my sleeves to try to figure out what the problem could be. I was fairly certain the problem was with my code—I’m not much of a back-end programmer.

My tech stack is classic LAMP: Linux, Apache, MySQL and PHP. When I post a note, it gets saved to my database. Then I make a curl request to the Mastodon API to syndicate the post over there. That’s when my CPU starts climbing and my server gets all “bad gateway!” on me.

After spending far too long pulling apart my PHP and curl code, I had to come to the conclusion that I was doing nothing wrong there.

I started watching which processes were making the server fall over. It was MySQL. That seemed odd, because I’m not doing anything too crazy with my database reads.

Then I realised that the problem wasn’t any particular query. The problem was volume. But it only happened when I posted a note to Mastodon.

That’s when I had a lightbulb moment about how the fediverse works.

When I post a note to Mastodon, it includes a link back to the original note to my site. At this point Mastodon does its federation magic and starts spreading the post to all the instances subscribed to my account. And every single one of them follows the link back to the note on my site …all at the same time.

This isn’t a problem when I syndicate my blog posts, because I’ve got a caching mechanism in place for those. I didn’t think I’d need any caching for little ol’ notes. I was wrong.

A simple solution would be not to include the link back to the original note. But I like the reminder that what you see on Mastodon is just a copy. So now I’ve got the same caching mechanism for my notes as I do for my journal (and I did my links while I was at it). Everything is hunky-dory. I can syndicate to Mastodon with impunity.

See? I told you it would only be of interest to me. Although I guess there’s a lesson here. Something something caching.

Thursday, February 8th, 2024

Offloading JavaScript With Custom Properties: HeydonWorks

With classes, we can send CSS static values but with custom properties we can send dynamic ones, which is a major shift in the way we can style state. This is something that has been true for some time—and is extremely well supported—but sometimes it takes solving a small real-world problem to make you appreciate the value of it.

I think we still haven’t come to fully appreciate the superpower of custom properties: dynamic values that are shared between CSS and JavaScript.

Tuesday, January 30th, 2024

HTML Web Components on the Server Are Great | Scott Jehl, Web Designer/Developer

Scott has written a perfect description of HTML web components:

They are custom elements that

  1. are not empty, and instead contain functional HTML from the start,
  2. receive some amount of progressive enhancement using the Web Components JavaScript lifecycle, and
  3. do not rely on that JavaScript to run for their basic content or functionality.

Friday, January 26th, 2024

Concatenating text

Why the heck is everyone reaching for React as soon as something on the screen needs to update? And why do we insist on squishing our frontend concerns together with our backend concerns?

I’m glad I’m not the only one constantly asking myself those questions.

Look: is the idea of physically separating “code that runs business logic and builds markup” from “code that handles realtime interactions” really that awful? You can write both in JS if you like, I promise I won’t judge, but we can’t keep pretending that they’re basically the same.

Thursday, December 21st, 2023

Simple Web Server

This is a handy little tool for spinning up a local web server when you don’t all the features of something like MAMP.

Friday, December 8th, 2023

scottjehl/PE: declarative data binding for HTML

This is an interesting idea from Scott—a templating language that doesn’t just replace variables with values, but keeps the original variable names in there too.

Not sure how I feel about using data- attributes for this though; as far as I know, they’re intended to be site-specific, not for cross-site solutions like this.

Friday, November 24th, 2023

CSS { In Real Life } | Choosing a Green Web Host

Earlier this month, Jeremy Keith posed the question: “How green is my server?”. As Jeremy notes, it’s surprisingly hard to get that information! So how do you ensure that you’re hosting your website on a green server?

Sunday, November 19th, 2023

How green is my server?

The Session does very well in terms of performance. You can see the data from the Chrome UX Report (CRUX).

What’s good for performance is good for the environment. Sure enough, The Session gets a very high score from the website carbon calculator:

Hurrah! This web page achieves a carbon rating of A+

This is cleaner than 99% of all web pages globally

But under the details about hosting it says:

Oh no, it looks like this web page uses bog standard energy

The Session is hosted on DigitalOcean, who tend to be quite tight-lipped about their energy suppliers. Fortunately others have done some sleuthing to figure out which facilities are running on green energy.

One of the locations to get the green thumnbs up is the Amsterdam facility housed by Equinix. That’s where The Session is hosted.

I’m glad that I was able to find out that the site is running on 100% renewable energy, but I wish I didn’t need to go searching to find this out. DigitalOcean need to be a lot more transparent about the energy sources for their hosting facilities.

Sunday, December 18th, 2022

Indiekit

Paul’s indie web project is live!

Meet the little Node.js server with all the parts needed to publish content to your personal website and share it on social networks.

You can read the accompanying blog post.

Friday, November 18th, 2022

Remix and the Alternate Timeline of Web Development - Jim Nielsen’s Blog

It sounds like Remix takes a sensible approach to progressive enhancement.

Tuesday, November 15th, 2022

Mastodon is just blogs

Do you still miss Google Reader, almost a decade after it was shut down? It’s back!

A Mastodon server is a feed reader, shared by everyone who uses that server.

I really like Simon’s description of the fediverse:

A Mastodon server (often called an instance) is just a shared blog host. Kind of like putting your personal blog in a folder on a domain on shared hosting with some of your friends.

Want to go it alone? You can do that: run your own dedicated Mastodon instance on your own domain.

This is spot-on:

Mastodon is just blogs and Google Reader, skinned to look like Twitter.