The VBScript Network and Systems Administrator's Cafe:

wget

Oct 31 2008   6:17AM GMT

Using Internet Explorer objects to scrape links from web pages.



Posted by: Jerry Lees
web tools, web sites, wget, Web Pages, InternetExplorer.Application

 Recently, I needed to write a tool that would scrape the links from a page. To accomplish this I used the Internet Explorer object “InternetExplorer.Application“.  We’ll explain it a bit more in a later entry but for now, take a look at the code below:

URL = “http://itknowledgeexchange.techtarget.com/itblogs/

With CreateObject(”InternetExplorer.Application”)
  .Navigate URL
  Do Until .ReadyState = 4
    Wscript.sleep 10
  Loop
  for each link in .document.links
    Wscript.echo link, link.InnerText
  next

‘ Uncomment the three lines below to scrape references to images
‘  for each pix in .document.images
‘  Wscript.echo pix.src
‘  Next
 
  .Quit
End With

Aug 13 2008   2:29PM GMT

Essential tools: Wget, a command line tool to retrieve web pages



Posted by: Jerry Lees
web tools, free tools, Systems administrator tools, Toolkit, essential tools, wget, http tools, windows tools

There is nothing more annoying than having a web server or site down and IE (or FireFox, for that matter) become dog slow or simply getting in the way of trouble shooting the page. Additionally, sometimes these browsers actually get in the way of troubleshooting the problem by masking the error page the server sends back– IE’s “friendly” HTTP errors messages, for example. When it comes right down to fixing the problem, sometimes you need to retrieve just the HTML code a particular web page sends simply for inspection or analysis. That is where our next essential tool comes in!

Wget is a small (~325K) command line utility that allows you to download a HTTP, HTTPS, or FTP file quickly from the command line and save it locally so you can open it with a text editor, simply have it in an alternate location, or use in a comparison to what a specific browser renders after download. Wget for windows can be downloaded here. It’s a powerful tool, and covering all the options in one posting isn’t possible, so let’s start off with a little syntax to get you rolling:

In its  simplest form you can download a specific page, including a full URL, as shown below:

wget http://www.gamersigs.net

Alternatively, you can download a site and all its linked items recursively to a specific number of levels. This is useful to archive a site or  to simply grab pages that the HTML uses, but doesn’t link to directly– Cascading Style Sheets (css) for example. The syntax below will recursively get 2 levels of www.msn.com and automatically create a directory called www.msn.com in the current directory.

wget -r –level=2 http://www.msn.com

If the page links to a HTTPS page, wget will automatically try to negotiate a SSL connection. You can optionally specify the SSL protocol to use by adding –secure-protocol=PR, where PR is either auto, SSLv2, SSLv3, or TLSv1. This is especially helpful in testing and ensuring your servers do not respond to the weaker SSLv2 SSL protocol.

If you deal with websites as a part of your Systems Administration duties– or if you’re just interested in it as a side project at the office, I’m sure you’ll add this tool to your essential tools.

Know of a tool that you think is essential? Post a comment here and if I don’t already have it in my tool belt, I’ll add it and give it a shot. If it makes the grade– I’ll add it to the list of tools to review. The only criteria are:

  1. The tool must be free, or inexpensive with a “Per User” or “site” type license. (No pay per installation licenses, please)
  2. The tool (or it’s installation file) must be small enough to fit on a 256Mb flash drive for portability.
  3. Command line run time options are beneficial, but not required.
  4. If it has ads… it needs be truly INVALUABLE.
  5. It should make the user’s job easier by gathering information or preforming a task that a typical Network or Systems Administrator would preform.

Enjoy!