Header Ads

Recover Deleted Web Pages from the Internet Automatically


Being a developer is not easy, specially if you like to do things in hurry and are absent minded like me. I am sure you must have edited, deleted few files only to later realize that the older files are gone forever. So here is the solution which may work for live sites.

This will come handy when you are trying to recover an accidentally deleted website or you need to retrieve a web page that no longer exists at the original location.

You opened a web page on the Internet but the server hosting the site returns a 404 error meaning that either the web page has been removed or moved to a different location.

To recover the lost page, the best option is that you search the page across all three major search engines (Google, Yahoo, Windows Live Search) and hope that a copy of the web page exists in the cache somewhere.

1. Recover your Deleted Blog Posts from Google cache

Simple, if the url of the page which got lost is http://example.com/page-lost then go to www.google.com and in search box, type this cache: http://example.com/page-lost and it will show you the cached version of the page in Google index. Since most of the sites nowadays gets cached immediately, the chances of you finding your page in Google cache are very high.

PS1 – If you are a developer, I will highly advice you to start using CVS for big and long term projects, and even for smaller work, make sure all edits get into log files. Infact that is why using Dreamweaver or Eclipse or any such IDE is an excellent idea.

PS2 – if you are looking for a very old content, say a year back, you can look at http://www.archive.org/ and probably you can find a version of the page which you are looking for.

2. Using Warrick - Automatic Blog Recover Tool

Instead of manually copy pasting each and every cached article, you can have a look at Warrick. It is an automatic blog recovery web application that lets you reconstruct any lost website (or single web page) automatically. Simply type the URL of the web site and Warrick will let you know via email once the recover process is over. The tool is essentially a web crawler that scans and collects missing web pages from all the four web repositories - Internet Archive, Google, Live Search, and Yahoo. If a web page is found in more than one web repository, Warrick saves the page with the most recent date.

3. Using the Firefox Cache to Recover a Blog Post

If you are running a wordpress blog, you would like to read this article from WpHackr on how to use the Firefox cache to recover a blog post. Mind you, this is not the easiest way at all, its lots of hassles but still can be used as a last resort.

4. Recover the posts from the RSS feed

This is something very logical and easy. If you have been publishing your full blog posts as RSS feed you can go back to feedBurner and start digging through the older feeds and recover the individual blog posts. Scrape through your own posts!!

Prevention

When you FTP (upload) a file to your website, it will overwrite any file with the same name. So, before you FTP a new file, it's a good idea to keep a backup of the original file. If the new version of the page doesn't work as expected, you can compare it to the old version to troubleshoot, or you can just replace it with the original version of the page.

Making Backups

It's always a good idea to download a backup of your site, perhaps once a month (more if you frequently make changes). Just download & keep a copy on your computer, or burn it to a disk. It will be there if you ever need it.

Before I make significant changes to a page, I save a copy under another name, perhaps incorporating the date (if it's index.html, maybe save it as index030807.html or index1.html). That way I always have a copy of the page as it existed prior to the changes, in case I ever want to refer to it, go back or run into a problem.

What do you think of these ideas? Do you have an alternative / better ways? How often you take the backup of your blog?