Thursday, February 16, 2006

Some Favorite, Frequenty-Used Functions

As you know, I usually add something new to this blog every other day on the average. Why the delay this time? The main reason is that I have Verizon high-speed DSL (or, rather that I was without Verizon DSL for five days).

I phoned Verizon on Thursday to let them know that I was unable to connect to the Internet. I was told that I would have service restored "in 48 to 72 hours." How does that become five days? Simple. Verizon does not work on the weekends. (They presumably operate following the non-Biblical injunction, "Five days shalt thou labor...." It's not enough for them to take Sunday off, they apparently need to take Saturday off as well.)

So I was without Verizon DSL for five days. I was given an "open ticket," but whenever I called up to check on it, all I got was a recorded message telling me that they were "still working on my problem" and that "no further information was available." That is, that's what I got when I got a response other than "we're experiencing a greater-than-usual call volume at this time. Please hang up and try your call again later." (That's after you go through the voice-driven automatic menu system, where you're told, "I don't understand. Please answer 'Yes' or 'No.'" either because you don't respond because it's not really a "yes/no" type question or because you do respond but Verizon's voice recognition software doesn't "understand.")

I did finally get through to talk with a genuine human being five days later and got to talk with a technical support person who walked me through the procedure to get myself connected to the Internet again. (No explanation was given as to why this could not have been done five days earlier, and the implication was given that I must have pressed the reset button on my modem, which I know is something that I did NOT do.) Since I get hundreds of emails a day, needless to say, I was not very happy about being off-line for five days, but perhaps my experience with Verizon explains why I have not been able to add a new entry to this blog until now.

I hope to have a new program or two available to you by tomorrow, but for now let's briefly consider the topic of "a few functions I've added to REALbasic, since I use them so often." (If you have such a list, I'd be very interested to see it.)

I think all of us probably have such routines, perhaps gathered together in a module that we include in our Projects. (I'm not quite that organized yet, but there are some things I find myself regularly adding to Projects, so I should probably get around to putting them into a module to make them more readily accessible.)

What are some of my useful add-on subs and functions? Here's one:

Function Mid1 (M As String, Pos1 As Integer, Pos2 As Integer) As String
   Return MidB (M, Pos1, Pos2 - Pos1 + 1)
End Function

I often prefer this to the regular Mid or MidB, because I often have in mind when doing a string extraction the starting position and the end postion, not the starting position and how many characters. For those people who don't have to even think about "Pos2 - Pos1 + 1," such a function is not needed, but I personally find it helpful.

When I want to find the first occurence of StringA or StringB within StringC (and I want whichever comes first, whether it be StringA or StringB), I use this function, assuming that A = InStrB(StringC, StringA) and B = InStrB(StringC, StringB):

Function MinNonZero (A As Integer, B As Integer) As Integer
   If A <> 0 And B = 0 Then
      Return A
   ElseIf A = 0 And B <> 0 Then
      Return B
   ElseIf A <> 0 And B <> 0 Then
      Return Min (A, B)
   ElseIf A = 0 And B = 0 Then
      Return 0
   End If
End Function

I'm sure that the following could be written much more efficiently, but sometimes I want to create a longer string based on repetitions of a shorter string. Then I use this:

Function Rpt (A As String, B As Integer) As String
   Dim I As Integer
   Dim M As String
   M = ""
   For I = 1 To B
   M = M + A
   Next I
   Return M
End Function

Or, if I want to count how many instances there are of a shorter string within a larger string, I use this:

Function HowMany (A As String, WhatToCount As String) As Integer
   Dim Counter As Integer
   Counter = CountFields (A, WhatToCount) - 1
   If Counter = - 1 Then Counter = 0
   Return Counter
End Function

Or, when I want to find the position of not the first occurence of a shorter string within a longer string, but the last occurence, I use this:

Function InStrBRev (Pos1 As Integer, String1 As String, String2 As String) As Integer
   Dim I As Integer
   I = Pos1
   Do
      I = I - 1
   If I = 0 Then Exit
      If MidB (String1, I, Len (String2) ) = String2 Then Exit
   Loop
   Return I
End Function

One function I use surprisingly often (perhaps partly because of working with CodeHelper and related programs is this one:

Function OutsideQuotes (s1 As String, Pos1 As Integer) As Boolean
   Dim A As String
   Dim Counter As Integer
   A = LeftB (s1, Pos1 - 1)
   Counter = HowMany (A, Q)
   If Counter Mod 2 = 0 Then
      Return True
   Else
      Return False
   End If
End Function

The routine determines whether the character at a certain position within a string is "inside quotation marks" or "outside quotation marks."

Many of these routines, you may have noticed, involve string handling, and more efficient versions of some of these can be found in Joe Strout's StringUtils module at http://www.strout.net/info/coding/rb/intro.html
"This is a public-domain module of string functions, including ways to reverse a string, remove or delete certain characters, convert a string into hex, handle fields containing quotes, measure the similarity of two strings, and more.

The archive contains an RB project file which runs the unit tests on the various StringUtils functions. Just drag the StringUtils module out of this project, then drag it into your own projects, and enjoy.

I wrote most of mine before that package was created, so mine do not take advantage of the more efficient procedures to be found there.

Barry Traver



Home Page for This Blog: http://traverrb.blogspot.com/

Programs and Files Discussed in the Blog: http://traver.org/traverrb/

Wednesday, February 08, 2006

Web Page Harvester Program: Some Preliminary Thoughts

Coming in the very near future is a program to "harvest" information (links, email addresses, and other informaiton) from Web pages.

Actually, I already had a program written like that ready to upload, but I decided to rewrite it from scratch, since - although it works perfectly on some sites - it fails to get all the information on some other sites. (And I'm not talking about those sites that have chosen - and quite properly so - to "obfuscate" email addresses to protect the addressees from spam mail.

If you find that Harvester can "harvest" email addresses on your own Web site, you'll be glad to know that the next program will show you how to protect yourself against most "spam robots" or "spambots." I'm not giving away any "secrets" in either program: the information is freely available on the Internet if you know where and how to look for it, but use of the techniques in the latter program can create an additional barrier that most spammers are too lazy to bother about.

I don't know whether it's true or not, but I read somewhere recently (in defense of XHTML) that half the code in most browser programs is devoted to dealing with sloppily written HTML. From my own experience I don't consider that an unlikely figure.

The Harvester program I wrote worked fine with many or most site with which I tried it, but there were some sites where it either missed and messed up the information on the links or (rarely) the email addresses (except, of course, for those pages where the email addresses were "obfuscated" on purpose). In some cases, it was not a matter of improperly written HTML, but simply complex HTML which contained some items my program had not made provision for.

For example, an easy fix was one for the proper handling of a Spanish "tilda n" or a French "c cedilla." But I found other problems that were not as easy to fix, and trying to do so was resulting in hard-to-follow spaghetti code as a result of all of the "band-aids" I was adding. So I decided that - rather than upload that program - it would be better to rewrite it from scratch.

What sort of things might cause a problem for a "harvester" program or at least need to be taken into account? (I'm sort of thinking out loud in today's blog entry.) Immediately, I can think of five.

One is the fact that a link or an email address does not have to appear entirely on one line. It may be split across a couple of lines (or even three, at least one instance of which I found and had to make adjustment for). Perhaps the easiest way to deal with this is to replace any end-of-line markers with a single blank space. That may or may not always work, if the line ends with a hyphen ("-") immediately before the line break. (I'll have to cehck on that.)

Another is the fact that, in general, extra spaces do not matter in HTML (but they DO matter in a "harvester" program). There are, however, cases where extra spaces DO matter in HTML (e.g., between ""<pre>" and ""</pre>" or within quotation marks). So a Web page "harvester" has to take into account such detais. In addition, there may (or may not) be some cases where HTML does't care whether there is a space or no space at all. (That's something else I'll have to check on, either by consulting the reference books or by experimenting myself.)

The obvious basic solution is for Harvester to remove any extra spaces, replacing groups of blank spacs with single blank spaces, BUT it has to be careful where to do this. (It cnnot do it, as I said, with groups of spaces within quotation marks or within a "<pre>" block.)

A third consideration is the matter of capitalization, something that UNIX and LINUSX servers care about (but not Windows NT). Fortunately, REALbasic makes this fairly simple to do (or at least, in general, simpler than in Visual Basic), because in (at least most) REALbasic string comparisions, case is ignored (but that is not always true), and most (but not all) of the time we want to ignore the case (xcept in filenames). Harvester has to do string searches and comparisons to perform its work, so the matter of case can be important.

A fourth thing you need to take into consideration is the matter of nested tags or the matter of attributes or values contained in qauotation marks (or not) within tags. For example, in HTML such nesting is often true of pairs of anchor tags ("<a [whatever]>" and "</a>"). Sometimes you may want to save what is inside the anchor tag (such as the information in the image tag, "<img src=[whaever]>").

Fifthly, you may want to deal with links to certain types of files (e.g., ".htm" and ".html" files) in a different way than you handle links to other types of files (e.g., ".jpg", "mpg", or "mp3" files). For reasons that I will not go into here, the first Harvester only "harvested" (and probably its successor will only "harvest") the information relating to "text links" (and thus not include any links to music files or links based on "thumbnails" among the link information that it harvests).

Have I left out anything important among the preliminary considerations? If so, please feel free to send me a private email or blog comments with your thoughts. Unlike C (I'm told), REALbasic is rich in built-in string functions, so we'll be making use of that fact in Harvester. The program also show how to download the contents of an HTML (or XHTML file) from the Inetnet. (We can't "harvest" information from a Web page on the Internet unless we are able to download that Web page from the Internet!)

Keep tuned....

Barry Traver



Home Page for This Blog: http://traverrb.blogspot.com/

Programs and Files Discussed in the Blog: http://traver.org/traverrb/