Web scraping directly from the shell

Even before I was in the business of coding an RSS reader and using it all day long, I was looking for an efficient way to know the time and movement of the ocean tide near me, and an RSS feed seemed like a good idea.

I even (not really sure) seem to remember such a thing, but if it ever existed it’s down now anyway ; And for a reason : It’s no small feat to be able to serve potentially thouzillions of RSS feeds, one by seaside location.
So I used web sites. Get to the browser, mistype the URL, that sort of things but I mean we’re programmers, we only do anything thrice before we write a script to do it in our stead.

Thing is, web scraping is usually the workhorse of big frameworks, I should know, being the author of Feedrat and Favrat, that respectively hunt, find and report RSS feeds and favicons in web pages, only using Nodejs and a boatload of dependancies.

Recently I noticed that I’ve been using the same site for months, and that it was really stable. So I decided to search for “web scraping from the shell” and found hxnormalize and hxselect ; Those two commands are available in the html-xml-utils package.

With this two commands, we can

  • Read HTML (specifically, get an object representing the DOM of the page, and no longer a simple string of characters.)
  • Select any element in the DOM

So we only need a vehicule, a vessel to transmit said web page to us, and with those two commands we can do something meaninful with it. This vessel will be wget. Or curl. Yeah, good ol’ wget‘ll do.

install the packages needed for this tutorial (Debian, adapt to your distro)
1
apt install html-xml-utils lynx wget

Yes, we’ll need lynx later, but anyway you have gotten to have those commands installed all the time one way or another or I don’t know what’s wrong with you 😉

Here is the website that I used to get the next tide times of any earth location:

Or just press F12 and seek the bastardOr just press F12 and seek the bastard

But now, we want this data available to us directly on the desktop, now how do we do that ? Right-click on the element that interests you, and select Inspect ; If you aimed well you’ll get the same result as the image above.

Take note of the name (actually the type and the class) of the element, in our case p.lead and you’re on your way 😎

Keyboard time

Let’s test this : Enter the following command in the console (zsh and bash both work) you do have wget installed, right?

1
echo "https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead"

And here is the output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
--2022-10-13 20:03:06--  https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html
Resolving www.cabaigne.net (www.cabaigne.net)... 000.00.00.000, 000.00.0.000, 000.00.0.000, ...
Connecting to www.cabaigne.net (www.cabaigne.net)|172.67.71.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

- [ <=> ] 48.02K 21.7KB/s in 2.2s

2022-10-13 20:03:09 (21.7 KB/s) - written to stdout [49170]

FINISHED --2022-10-13 20:03:09--
Total wall clock time: 4.0s
Downloaded: 1 files, 48K in 2.2s (21.7 KB/s)
<p class="lead"><img src="/site/images/temps.png"></img> <span class="label label-danger">marée
haute</span> Le jeudi 13 octobre 2022 à <b>22:10</b></p>%

Bingo! In this small string are the 2 infos we need:

  • The current state of the tide (in our case “haute” (high))
  • The date / hour of the next tide

All that is left now is to get rid of the HTML formatting, which only a green noob would endeavor : HTML (or any markup language for that matter) parsing is a pain whose memory never really goes away, ask any poor soul that ever attempted it.

No, what we are going to do is to politely ask Lynx (you do have lynx installed) whose job it is to do just that (parse HTML) to output us a clean, unformatted string:

1
echo "https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead" | lynx -stdin -dump

Output

1
2
3
4
5
FINISHED --2022-10-13 20:10:15--
Total wall clock time: 0.7s
Downloaded: 1 files, 48K in 0.2s (219 KB/s)
[temps.png]
marée haute Le jeudi 13 octobre 2022 à 22:10

Oh, shoot, there is a markup element in our way. Look at the above image again. No getting around this, this dang

1
<img src="/site/images/temps.png">

Is there, in the very div, there’s no way around this. Oh well, we’re hackers, let’s just get rid of it in the received string using good old sed

1
echo "https://www.cabaigne.net/france/bretagne/saint-quay-portrieux/horaire-marees.html" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead" | lynx -stdin -dump | sed '1d'

Output

1
2
3
4
FINISHED --2022-10-13 20:20:26--
Total wall clock time: 1.7s
Downloaded: 1 files, 48K in 0.2s (194 KB/s)
marée haute Le jeudi 13 octobre 2022 à 22:10

Beach time 😎

You can do what you want with the data gathered ; Here is how I put it together to get the next tide time & movement right in my i3status bar, accessible at the glimpse of an eye ; Just pass the url as the first argument.

~/.scripts/px-i3-next-tide-time.sh [url]
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env bash

PHRASE=$(echo "$1" | wget -O- -i- | hxnormalize -x | hxselect -i "p.lead" | lynx -stdin -dump | sed '1d')

MOVE=$(echo $PHRASE | awk '{print $2}')
TIME=$(echo $PHRASE | awk '{print $9}')

[[ $MOVE = "basse" ]] && ICON="" || ICON=""

echo "$TIME $ICON" > /tmp/.next_tide_time.txt

In i3

~/.config/i3/config
1
2
3
set $next_tide_script ~/.scripts/px-i3-next-tide-time.sh [url]
exec_always --no-startup-id $next_tide_script

By that point the script will run at each i3 start or restart ; If - like me - you want it to run every hour no matter what, just use something like this (with your own ${MYSCRIPTDIR} of course)

1
(crontab -l 2>/dev/null; echo "0 * * * * ${MYSCRIPTDIR}/px-i3-next-tide-time.sh") | crontab -

Yes, that’s a crontab entry directly from the shell, whithout having to use visudo or other nonsense, yes I know you read it here first, thank you very much.

~/.config/i3/i3status.conf
1
2
3
4
5
6
7
8
order += "read_file next_tide_time"

read_file next_tide_time {
format = "<span foreground='#ffffff'></span> <span foreground='#3daee9'> %content</span>"
format_bad = ""
path = "/tmp/.next_tide_time.txt"
}

That’s it, see you next time, keep it real. Oh, and beware of the urchins 🐚