Developing a Web application

2845 pts.
Tags:
Development
PHP
Web development tools
I'm passing along a question posed by a reader of SearchEnterpriseLinux.com. Could someone offer some advice for him? Question: I am creating a web application with a sign-in form The fields are: Website URL, e-mail. When the user submits the information then first it searches whether that domain is registered or not. If the domain is not registered then it generates the error message: "Your website is not registered." If the domain is registered, then my site searches that website for a particular word. My questions: Is it possible for my site to find how many pages there are in the website that the user provides, and can the site scan all of them? How would I set that up?

Answer Wiki

Thanks. We'll let you know when a new response is added.

In answer to your question, yes you can. It is not the easiest thing to accomplish. You will need to “Spider” the web site to retrieve the pages by the links in the pages. There have been some rudimentary spider packages available in freeware and opensource sites in the past. I’m not sure where I saw them, but if you google for “spider freeware” and “spider opensource” you should get a few hits.

Discuss This Question: 3  Replies

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when members answer or reply to this question.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Bobkberg
    John Brandt already provided the basic answer. The only other caveat I'd offer is that a spider utility will only show you the web pages that are linked through the index (or other known) page. Many sites have "landing" pages specifically designed for advertising click-through that are never referenced through anything but ads, and some pages which are meant to remain private. The other place you can look is to do a search on that domain for robots.txt, since (if present) that specifies the files/pages/directories which a spider should or should not consider fair game for indexing, and may provide you with the information you need. Bob
    1,070 pointsBadges:
    report
  • Riverwind
    It depends on how you design your web application. At server level you can use software like 'spider' to monitor the usage. I think that the most suitable method is to develop the controls and logs on the web application itself, i.e., the application level of control. However, it is quite tough for it.
    0 pointsBadges:
    report
  • NigelMcFarlane
    If your web page is to be a good global citizen, then you shouldn't scan a site for pages if the site's robots.txt tells you not to. An excellent client-side web spidering library can be found here: www.bclary.com. However: you're pretty stuck client-side because of security. You can't script into a web page that belongs to a foreign site (The "Same Origin Policy"). That means you can't follow all the links in the other site's loaded pages, to disciver the size of the site. That means your code that accepts the form submission has to do it. You might as well use the Google API (it's SOAP I think) and ask Google to do the word search for you. Then you don't need to count or scan pages at all - just use the Google results. - N.
    0 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

To follow this tag...

There was an error processing your information. Please try again later.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Thanks! We'll email you when relevant content is added and updated.

Following