Wednesday, February 08, 2006

Web Page Harvester Program: Some Preliminary Thoughts

Coming in the very near future is a program to "harvest" information (links, email addresses, and other informaiton) from Web pages.

Actually, I already had a program written like that ready to upload, but I decided to rewrite it from scratch, since - although it works perfectly on some sites - it fails to get all the information on some other sites. (And I'm not talking about those sites that have chosen - and quite properly so - to "obfuscate" email addresses to protect the addressees from spam mail.

If you find that Harvester can "harvest" email addresses on your own Web site, you'll be glad to know that the next program will show you how to protect yourself against most "spam robots" or "spambots." I'm not giving away any "secrets" in either program: the information is freely available on the Internet if you know where and how to look for it, but use of the techniques in the latter program can create an additional barrier that most spammers are too lazy to bother about.

I don't know whether it's true or not, but I read somewhere recently (in defense of XHTML) that half the code in most browser programs is devoted to dealing with sloppily written HTML. From my own experience I don't consider that an unlikely figure.

The Harvester program I wrote worked fine with many or most site with which I tried it, but there were some sites where it either missed and messed up the information on the links or (rarely) the email addresses (except, of course, for those pages where the email addresses were "obfuscated" on purpose). In some cases, it was not a matter of improperly written HTML, but simply complex HTML which contained some items my program had not made provision for.

For example, an easy fix was one for the proper handling of a Spanish "tilda n" or a French "c cedilla." But I found other problems that were not as easy to fix, and trying to do so was resulting in hard-to-follow spaghetti code as a result of all of the "band-aids" I was adding. So I decided that - rather than upload that program - it would be better to rewrite it from scratch.

What sort of things might cause a problem for a "harvester" program or at least need to be taken into account? (I'm sort of thinking out loud in today's blog entry.) Immediately, I can think of five.

One is the fact that a link or an email address does not have to appear entirely on one line. It may be split across a couple of lines (or even three, at least one instance of which I found and had to make adjustment for). Perhaps the easiest way to deal with this is to replace any end-of-line markers with a single blank space. That may or may not always work, if the line ends with a hyphen ("-") immediately before the line break. (I'll have to cehck on that.)

Another is the fact that, in general, extra spaces do not matter in HTML (but they DO matter in a "harvester" program). There are, however, cases where extra spaces DO matter in HTML (e.g., between ""<pre>" and ""</pre>" or within quotation marks). So a Web page "harvester" has to take into account such detais. In addition, there may (or may not) be some cases where HTML does't care whether there is a space or no space at all. (That's something else I'll have to check on, either by consulting the reference books or by experimenting myself.)

The obvious basic solution is for Harvester to remove any extra spaces, replacing groups of blank spacs with single blank spaces, BUT it has to be careful where to do this. (It cnnot do it, as I said, with groups of spaces within quotation marks or within a "<pre>" block.)

A third consideration is the matter of capitalization, something that UNIX and LINUSX servers care about (but not Windows NT). Fortunately, REALbasic makes this fairly simple to do (or at least, in general, simpler than in Visual Basic), because in (at least most) REALbasic string comparisions, case is ignored (but that is not always true), and most (but not all) of the time we want to ignore the case (xcept in filenames). Harvester has to do string searches and comparisons to perform its work, so the matter of case can be important.

A fourth thing you need to take into consideration is the matter of nested tags or the matter of attributes or values contained in qauotation marks (or not) within tags. For example, in HTML such nesting is often true of pairs of anchor tags ("<a [whatever]>" and "</a>"). Sometimes you may want to save what is inside the anchor tag (such as the information in the image tag, "<img src=[whaever]>").

Fifthly, you may want to deal with links to certain types of files (e.g., ".htm" and ".html" files) in a different way than you handle links to other types of files (e.g., ".jpg", "mpg", or "mp3" files). For reasons that I will not go into here, the first Harvester only "harvested" (and probably its successor will only "harvest") the information relating to "text links" (and thus not include any links to music files or links based on "thumbnails" among the link information that it harvests).

Have I left out anything important among the preliminary considerations? If so, please feel free to send me a private email or blog comments with your thoughts. Unlike C (I'm told), REALbasic is rich in built-in string functions, so we'll be making use of that fact in Harvester. The program also show how to download the contents of an HTML (or XHTML file) from the Inetnet. (We can't "harvest" information from a Web page on the Internet unless we are able to download that Web page from the Internet!)

Keep tuned....

Barry Traver

Home Page for This Blog:

Programs and Files Discussed in the Blog:


At 3:16 PM, Blogger Steve said...

This sounds interesting. So do you have a follow-up planned anytime soon? I miss seeing your articles I hope you keep posting.


At 12:41 PM, Blogger Barry Traver said...

Steve, You're the first person I've heard from since I posted the item for February 16, 2006, so I had pretty much decided that no one was reading the blog, so why post? Thanks for taking the time to post a comment and letting me know otherwise. As a result of your note, I will plan on resuming soon.

I've been continuing to write RB programs; I just haven't been posting them (because of the apparent lack of interest). There's nothing spectacular about any of the ones I've done recently (one of them creates off-line the URL for a Yahoo map for a particular address, for example), but I'll try to start posting again (and uploading the related RB programs).



Post a Comment

<< Home