Web scraping directly from the shell
Even before I was in the business of coding an RSS reader and using it all day long, I was looking for an efficient way to know the time and movement of the ocean tide near me, and an RSS feed seemed like a good idea.

I even (not really sure) seem to remember such a thing, but if it ever existed it’s down now anyway ; And for a reason : It’s no small feat to be able to serve potentially thouzillions of RSS feeds, one by seaside location.
So I used web sites. Get to the browser, mistype the URL, that sort of things but I mean we’re programmers, we only do anything thrice before we write a script to do it in our stead.
Thing is, web scraping is usually the workhorse of big frameworks, I should know, being the author of Feedrat and Favrat, that respectively hunt, find and report RSS feeds and favicons in web pages, only using Nodejs and a boatload of dependancies.
Recently I noticed that I’ve been using the same site for months, and that it was really stable. So I decided to search for “web scraping from the shell” and found hxnormalize and hxselect ; Those two commands are available in the html-xml-utils
package.
With this two commands, we can
- Read HTML (specifically, get an object representing the DOM of the page, and no longer a simple string of characters.)
- Select any element in the DOM
So we only need a vehicule, a vessel to transmit said web page to us, and with those two commands we can do something meaninful with it. This vessel will be wget
. Or curl
. Yeah, good ol’ wget
‘ll do.
1 | apt install html-xml-utils lynx wget |
Yes, we’ll need lynx
later, but anyway you have gotten to have those commands installed all the time one way or another or I don’t know what’s wrong with you 😉
Here is the website that I used to get the next tide times of any earth location:
Or just press F12 and seek the bastard
But now, we want this data available to us directly on the desktop, now how do we do that ? Right-click on the element that interests you, and select Inspect ; If you aimed well you’ll get the same result as the image above.
Take note of the name (actually the type and the class) of the element, in our case p.lead
and you’re on your way 😎
Keyboard time
Let’s test this : Enter the following command in the console (zsh
and bash
both work) you do have wget
installed, right?
1 | echo "https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead" |
And here is the output
1 | --2022-10-13 20:03:06-- https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html |
Bingo! In this small string are the 2 infos we need:
- The current state of the tide (in our case “haute” (high))
- The date / hour of the next tide
All that is left now is to get rid of the HTML formatting, which only a green noob would endeavor : HTML (or any markup language for that matter) parsing is a pain whose memory never really goes away, ask any poor soul that ever attempted it.
No, what we are going to do is to politely ask Lynx (you do have lynx
installed) whose job it is to do just that (parse HTML) to output us a clean, unformatted string:
1 | echo "https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead" | lynx -stdin -dump |
Output
1 | FINISHED --2022-10-13 20:10:15-- |
Oh, shoot, there is a markup element in our way. Look at the above image again. No getting around this, this dang
1 | <img src="/site/images/temps.png"> |
Is there, in the very div
, there’s no way around this. Oh well, we’re hackers, let’s just get rid of it in the received string using good old sed
1 | echo "https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead" | lynx -stdin -dump | sed '1d' |
Output
1 | FINISHED --2022-10-13 20:20:26-- |
Beach time 😎
You can do what you want with the data gathered ; Here is how I put it together to get the next tide time & movement right in my i3status bar, accessible at the glimpse of an eye ; Just pass the url as the first argument.
1 | !/usr/bin/env bash |
In i3
1 | set $next_tide_script ~/.scripts/px-i3-next-tide-time.sh [url] |
By that point the script will run at each i3 start or restart ; If - like me - you want it to run every hour no matter what, just use something like this (with your own ${MYSCRIPTDIR}
of course)
1 | (crontab -l 2>/dev/null; echo "0 * * * * ${MYSCRIPTDIR}/px-i3-next-tide-time.sh") | crontab - |
Yes, that’s a crontab entry directly from the shell, whithout having to use visudo
or other nonsense, yes I know you read it here first, thank you very much.
1 | order += "read_file next_tide_time" |
That’s it, see you next time, keep it real. Oh, and beware of the urchins 🐚