Scraping with Scrapy

01 Jan 2018

HAPPY NEW YEAR!!

Sooooo I was going to scrap with scrapy - this was a million times more painful than i imagined!!!

likely because i had no clue what i was doing. lol

here goes:

i had a tutorial from class (very similar to this[http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/] (http://mherman.org/blog/2012/11/05/scraping-web-pages-with-scrapy/) which i used to set up scrapy the proper way - but honestly, if you follow the commands (for build and run your web spiders) on the first page of scrapy at https://scrapy.org/ you have your spider up and running in less than a minute with the example!

i should have tried this earlier :(

the tutorials on the herman website and of course on scrapy https://doc.scrapy.org/en/0.16/intro/tutorial.html teach you the fundamentals which are extremely useful, but hey, you gotta make things work first before you delve into the details, right? …right?


one thing to note - remember to do try and except to catch errors! maybe it is only a noob like me that forgets… you don’t want your process to stop a few minutes after you go to sleep…


in other news - my processes stop running after my ssh session is ended! faints.

obviously, i did not realise this and wasted one night yesterday… i shut down my computer. of course the logical thing is that since this is a cloud server, it continues running, right? WRONG.

ok so how do i get this thing going?? apparently there’s this thing called tmux (haven’t tried it yet) that helps.

see here:(https://askubuntu.com/questions/8653/how-to-keep-processes-running-after-ending-ssh-session)[https://askubuntu.com/questions/8653/how-to-keep-processes-running-after-ending-ssh-session]