Scraping: Hidden Glasgow

This was a great resource (the forums) in the past but it seems to no longer be maintained - there does seem to be a mod or two around, but I've tried to create an account and it needs manual review (which hasn't came). The contact email addresses on the site bounce.

Anyway, I took a complete archive of the forums using this:

wget --recursive --no-clobber --domains www.hiddenglasgow.com http://www.hiddenglasgow.com/forums/

This gave me an archive of around 3Gb of text (HTML). I don't have any plans to do anything with this yet, but if the site ever disappears I'll have a copy.

Basically, as I'm trying to put together a list of Glasgow based URLs, I'll search the entire forum for URLs and see what there is.

First, combine all the posts into one huge file

cat viewtopic.php* >> full_text_of_the_forum.txt

This is a file around 3.3Gb in size. I'll use hxwls to find the URLs (it's part of the HTML-XML-utils package apt install html-xml-utils).

hxwls full_text_of_the_forum.txt > full_list_of_urls.txt

That found 16.7 million URLs. I'll sort it and remove the duplicates:

sort full_list_of_urls.txt | uniq > url_list.txt

So there are 298,318 unique URLs. I'm not interested in the relative links (to profiles, other replies etc), so I'll filter it down to only URLS with the schema (http):

grep -i http url_list.txt > http_links.txt

Down to 62.5k links. I'm not interested in the images at the moment, I'll deal with them later maybe:

grep -vi -E "(gif|jpg|png)$" http_links.txt > non-image-links.txt

22k links left. Now I'll work through them, first I'll run them through a DNS filter script that will find which domains no longer exists and remove them. Then it'll search for 404s and move those into a seperate list.

Check out the Scraping page for more information.

The Open Guide to Glasgow

User Tools

Site Tools

Scraping: Hidden Glasgow

Page Tools