====== Scraping: Hidden Glasgow ======

This was a great resource (the forums) in the past but it seems to no longer be maintained - there does seem to be a mod or two around, but I've tried to create an account and it needs manual review (which hasn't came).  The contact email addresses on the site bounce.

Anyway, I took a complete archive of the forums using this:
<code bash>
wget --recursive --no-clobber --domains www.hiddenglasgow.com http://www.hiddenglasgow.com/forums/
</code>

This gave me an archive of around 3Gb of text (HTML).  I don't have any plans to do anything with this yet, but if the site ever disappears I'll have a copy.

Basically, as I'm trying to put together a list of Glasgow based URLs, I'll search the entire forum for URLs and see what there is.

First, combine all the posts into one huge file
<code bash>
cat viewtopic.php* >> full_text_of_the_forum.txt
</code>

This is a file around 3.3Gb in size.  I'll use hxwls to find the URLs (it's part of the HTML-XML-utils package ''apt install html-xml-utils'').
<code bash>
hxwls full_text_of_the_forum.txt > full_list_of_urls.txt
</code>

That found 16.7 million URLs.  I'll sort it and remove the duplicates:
<code bash>
sort full_list_of_urls.txt | uniq > url_list.txt
</code>

So there are 298,318 unique URLs.  I'm not interested in the relative links (to profiles, other replies etc), so I'll filter it down to only URLS with the schema (http):
<code bash>
grep -i http url_list.txt > http_links.txt
</code>
Down to 62.5k links.  I'm not interested in the images at the moment, I'll deal with them later maybe:
<code bash>
grep -vi -E "(gif|jpg|png)$" http_links.txt > non-image-links.txt
</code>
22k links left.  Now I'll work through them, first I'll run them through a [[Scraping/DNS filter]] script that will find which domains no longer exists and remove them.  Then it'll search for 404s and move those into a seperate list.  

Check out the [[/Scraping]] page for more information.