====== Scraping: Hidden Glasgow ====== This was a great resource (the forums) in the past but it seems to no longer be maintained - there does seem to be a mod or two around, but I've tried to create an account and it needs manual review (which hasn't came). The contact email addresses on the site bounce. Anyway, I took a complete archive of the forums using this: wget --recursive --no-clobber --domains www.hiddenglasgow.com http://www.hiddenglasgow.com/forums/ This gave me an archive of around 3Gb of text (HTML). I don't have any plans to do anything with this yet, but if the site ever disappears I'll have a copy. Basically, as I'm trying to put together a list of Glasgow based URLs, I'll search the entire forum for URLs and see what there is. First, combine all the posts into one huge file cat viewtopic.php* >> full_text_of_the_forum.txt This is a file around 3.3Gb in size. I'll use hxwls to find the URLs (it's part of the HTML-XML-utils package ''apt install html-xml-utils''). hxwls full_text_of_the_forum.txt > full_list_of_urls.txt That found 16.7 million URLs. I'll sort it and remove the duplicates: sort full_list_of_urls.txt | uniq > url_list.txt So there are 298,318 unique URLs. I'm not interested in the relative links (to profiles, other replies etc), so I'll filter it down to only URLS with the schema (http): grep -i http url_list.txt > http_links.txt Down to 62.5k links. I'm not interested in the images at the moment, I'll deal with them later maybe: grep -vi -E "(gif|jpg|png)$" http_links.txt > non-image-links.txt 22k links left. Now I'll work through them, first I'll run them through a [[Scraping/DNS filter]] script that will find which domains no longer exists and remove them. Then it'll search for 404s and move those into a seperate list. Check out the [[/Scraping]] page for more information.