Website capture and comparison

90 pts.
Tags:
Content management applications
Offline Backup
Website Testing
We have an interesting project in that we would like to be able to capture a web site for off line viewing (I realise there are many apps that can do this) but then we would also like to be able to re-capture the content at a later date and have an automatic process compare the initial capture with the re-capture and produce a report of the changes (if any)?? Been asking around and it seems this might be a difficult thing to do. Any ideas would be appreciated.

Answer Wiki

Thanks. We'll let you know when a new response is added.

In that case it is a subject I can support . Unwanted web content is something that bugs me personally, and professionaly in the work space.

To my knowledge, there is no easy way of doing this. I have used the off line browsing capability on my browser to automatically upload a single page, if it changed at some point after my last visit, but your task is far more complex. I do not cross check pages in the quantity you will be doing.

The functions that spring to mind are serious applications to download data, save and comparrison checking. As I am sure you are aware of in your quest to provide a solution.

For instance, some pages of my website have dates and times associated with them. e.g. todays date, user last visited date and time, which will have changed each time it was visited. So, that will be a change that will have to be taken into consideration when automatically checking the saved pages for differences.

I use a web update function in excel, to download a page of latest lottery results, each time I open the work book. Macros format, copy and sort the data into the respective sheets. Maybe Word, or Publisher could be used in the same way using VBA.

This is out of my league, but my train of thought is running along the lines of down loading web pages into publisher or word, on a regular basis, and then comparing the changes in them using file comparison s/w. If it is only 1 or 2 errant pages you need to check, scripting may be just as good.

Maybe someone with better scripting skills could come up with a automated script to do this. It does, however, bring up the fact that, maybe, at sometime a human will have to look at the text of the page and decide if the change is a valid content change, or just a change like you would get from my site (last date visited etc)

Good luck with this project, let me know if baldness sets in.

Discuss This Question: 5  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Boardpig
    Nobody has any ideas on this?
    90 pointsBadges:
    report
  • carlosdl
    Any ideas about what, specifically ? I have never done something like that, but If you know how to download the site's files, you could then open each file and compare it to the newer version, line by line, or with the help of some other tool like the Linux diff command.
    69,920 pointsBadges:
    report
  • Chippy088
    Most Web browsers can be set to download the page again, if it changes. Maybe that is the way to go. Content of the pages for public viewing are not sensitive or they wouldn't be there. You can view the code in some web browsers, but there are private files which you will not have access to, held on the ISP server. The problem with downloading a site, is that if you are trying to copy it through the internet, you will not get all the files on the site, unless you are an admin of the site. In that case you wouldn't be asking this question, so I assume you are not. So if the site collects data from visitors, for whatever reason, it would be breaking the Data Protection Act if you could see this type of information. As to why you want to do this, or what information you want to compare, you don't say. Therefore, I think I would be very cautious in the help given.
    4,625 pointsBadges:
    report
  • Boardpig
    Hello, Yeah I understand this could be sensitive. The organisation I am working with are in the business of tracking and classifying unsuitable material on web sites and are looking for an automated way of re-checking the websites after classification. (without the human going back to look at them). Ie when thew website has been classfied, the hoster may makes changes to the site to alter the classification. They are looking for an automated way of detecting the change.
    90 pointsBadges:
    report
  • Chippy088
    I have just noticed I entered my reply in the wiki section. (I just started typing where my mug shot was) I do not mean to imply this is the only answer. In fact, I hope some one does come up with the a more informed answer, as it has aroused my interest too.
    4,625 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following