The iGMonkey Example Application

The Application Innocuous Query Censored Query
The Application Innocuous Query Censored Query

Google.cn, a local version of the Google search engine for China, launched in January 2006, and sparked a flurry of arguments over the relative ‘evilness’ of this act. The crux of the matter was that the Chinese government refused to allow Google.cn to operate, unless it agreed to censor specific search results at the government's request.

Unfortunately, informed debate is not possible without good information, and, at the time of this writing, Google and the Chinese government are the only parties that know what results Google.cn censors. Happily, the Google Homepage API makes it possible to inform this debate using Google's own technology!

The example application is a mashup that provides a simple way for anyone to simultaneously query both Google.com, and Google.cn, and compare the results. The Homepage API makes it possible. iGMonkey makes it easy.

Setup and Preparation

Click Add to Google to add the iGMonkey example application to your Google Personalized Homepage. You may also add it manually by following these steps:

  1. Point your browser at the Google Personalized Home Page.
  2. Click the Add Content link in the upper-left corner of the browser.
  3. In the box labeled, Search by topic or feed URL, enter
    http://www.csua.berkeley.edu/~dans/google/mashups/gvg/google.com_v_google.cn.xml
    and click Go.
  4. Your browser will display a warning message. Read it, and click OK.

Try out the iGMonkey example application that now appears on your personalized homepage.

What iGMonkey Examples Share

All the iGMonkey examples take advantage of its document retrieval and rendering pipeline. Consequently, all examples share, essentially, the same code to create an iGMonkey object, configure, and invoke it. Additionally, all the examples use iGMonkey filters. Currently, two types of filters exist, and all filters of the same type share certain qualities.

iGMonkey Creation, Configuration, and Invocation Code

 1 function submit__MODULE_ID__ (query) {
 2     var monkey =
 3         new IGMonkey([ "http://www.google.com/search?hl=en&q=" + _esc(query) +
 4                        "&btnG=Google+Search",
 5                        "http://www.google.cn/search?hl=zh-CN&q=" + _esc(query) +
 6                        "&btnG=Google+搜索&meta=" ],
 7                      [ "Google.com", "Google.cn" ],
 8                      [ _gel("fwResDiv"+__MODULE_ID__),
 9                        _gel("cnResDiv"+__MODULE_ID__) ]);
10 
11     monkey.requestFilters.push(qFltrBustCacheDay);
12     monkey.resultFilters.push(sFltrGoogleResultsapproach__MODULE_ID__);
13     // monkey.resultFilters.push(sFltrViewHTML);
14 
15     monkey.exec();
16 }

iGMonkey constructor call, configuration, and invocation

Lines 2-9 create a new iGMonkey object by calling its constructor. The simplest form of iGMonkey's constructor takes three arrays as arguments:

3-6
the urls to query
Note: _esc is a Homepage API wrapper around escape and encodeURIComponent.
7
the human-readable names describing each query source
8,9
the DOM Nodes where iGMonkey will output its rendered results. iGMonkey refers to each Node as a canvas.
Note: _gel is a Homepage API wrapper around getElementById.

If you are only working with a single document, the constructor call can actually be simplified further. In that case, you don't need to bundle your arguments into arrays, iGMonkey will automagically do the right thing. Remember that this only works with a single document.

Lines 11-13 add filters to the iGMonkey object's rendering pipeline:

11
qFltrBustCacheDay is a filter included with iGMonkey that adds a parameter with the current date as its value to the request string. This works around a bug where the Google Content Proxy caches pages too aggressively.
12
sFltrGoogleResultsapproach__MODULE_ID__ is the main workhorse filter supplied by each example. It is named sFltrGoogleResultsDOM__MODULE_ID__ in the DOM example, and sFltrGoogleResultsRegex__MODULE_ID__ in the regular expression example.
13
sFltrViewHTML is an iGMonkey filter that lets you view the HTML source of content at any stage in the rendering pipeline. This is very helpful when debugging. Try copying one of the examples and uncommenting this line to see it in action.

Line 15 is where the magic happens. Invoking the iGMonkey's exec method sets the following events in motion:

  1. iGMonkey passes each query string, i.e. the first argument to the constructor, through the requestFilters chain.
  2. Next, iGMonkey takes each filtered query string, and fetches the document it refers to.
  3. Then iGMonkey passes each retrieved document through the resultFilters chain.
  4. Finally, iGMonkey displays each filtered document on its respective canvas, i.e. the third argument to the constructor.

The iGMonkey Filter Naming Convention

iGMonkey filters are named as follows: The first letter indicates the filter type, and should be lower case. At present, the types are request filters and result filters. To the filter type, one appends the string Fltr. Finally, you add the name of the filter function, capitalizing the first letter of each word in its name. Thus, sFltrViewHTML is a result filter that allows you to view the html source of the input canvas (i.e., the input DOM Node object).

Anatomy of iGMonkey Request Filters

Request filters take a single argument, namely a string that represents a request, for example:
"http://www.google.com/search?hl=en&q=foo&btnG=Google+Search"
The request string may or may not be encoded (see encodeURI and encodeURIComponent). Consequently, request filters should not assume whether or not the input request argument is encoded. I believe this will allow developers greater flexibility in writing request filters, but, if it presents interoperability problems in practice, I will establish a standard convention. Alternatively, I will add support for URI encoding canary traps to iGMonkey.

The internal details of a request filter are up to you. It is almost certain that your request filters will return a modified request string. It is highly advisable that request filters return a valid request string, as this facilitates request filter interoperability and reuse. Regardless, all request filters must return a request string.

Anatomy of iGMonkey Result Filters

Result filters take a single argument, a DOM Node object on which to operate on. By convention, this argument is named canvas. Typically, the canvas.innerHTML property is a string that contains the raw HTML text the object represents. Of course, you may also use standard Node interface methods, such as getElementsByTagName and appendChild, to query and manipulate the canvas.

All result filters must return a canvas object. Other than this, how you implement a result filter is up to you. Usually you return the same canvas passed in as an argument. On rare occasions, you may want to return a new or different canvas object. This is possible, but you must make sure to reparent the alternate canvas in order to connect it to the document, and you'll probably also want to deparent the original canvas. Should this situation arise, you may find that the CSS display, visibility, and z-index properties offer a more efficient mechanism to accomplish your desired effects.

The iGMonkey Examples

Each example links to the source code of two complete, functional Homepage API modules. The first link is a version of the module that does not take advantage of iGMonkey. The second link is a version utilizes of iGMonkey's retrieval and rendering pipeline, and leverages the utility filters and functions that iGMonkey provides.

The substantive differences between the examples of each approach are found in their respective sFltrGoogleResultsapproach__MODULE_ID__ functions. Each sFltrGoogleResultsapproach__MODULE_ID__ method is an iGMonkey result filter that uses the specified approach to parse Google's search results, and format the results for display. The annotated filter function sources appear on this page, and differences between the pre-monkey functions are highlighted.

Regular Expression Example

pre-monkey Module Source:
google.com_v_google.cn.regex.pre-im.xml
iGMonkey Module Source:
google.com_v_google.cn.regex.im.1.0.xml

Don't be deceived by sFltrGoogleResultsRegex__MODULE_ID__'s larger line count. The pre-monkey version appears shorter because it packs a lot of its functionality into cryptic, hard to modify, regular expressions one-liners. Nasty, opaque regexes like these are a hallmark of many perl programs that lead haters to derisively refer to it as a “read-only lanuage.”

I'm no hater, but I believe iGMonkey's utility functions make the second version clearer, easier to understand, and maintain. Furthermore, I know the iGMonkey version is significantly easier to write. Did I mention that iGMonkey's utility functions produce regular expressions which are designed to be more robust?

filter_google_results_regex__MODULE_ID__ Source

 1 function filter_google_results_regex__MODULE_ID__ (html) {
 2     var stanzas = [], header;
 3     var pattern = /<p class=g>(?:.|\s)+?(<p class=g>|<\/div>)/g;
 4 
 5     header = html.match(/<table width=100% border=0 cellpadding=0 cellspacing=0 bgcolor=#e5ecf9>(?:.|s)+?<\/table>/)[0];
 6     // Push the text `Results 1 - n of about x,xxx,xxx...' into its own row.
 7     // It would be nice to remove the query run time for screen real estate
 8     // reasons.  \(.*\d?\.\d{1,4}.*\)(.*) almost does it, but fails on .cn
 9     header = header.replace(/(<\/td>)(<td.*?>)(<font size=-1>.+)/,
10                             "$1</tr><tr>$2&nbsp;&nbsp;$3");
11     stanzas.push(header);
12 
13     for(var m; m = pattern.exec(html);) {
14         pattern.lastIndex -= m[1].length;
15         stanzas.push(m[0].substring(0, m[0].length - m[1].length));
16     }
17 
18     return "<span style='font-size:0.8em'>" + stanzas.join("") + "</span>";
19     // return filter_view_source__MODULE_ID__(stanzas.join());
20 }

Function that uses regular expressions to parse and reformat Google search results for display

Annotated sFltrGoogleResultsRegex__MODULE_ID__ Source

 1 function sFltrGoogleResultsRegex__MODULE_ID__ (canvas) {
 2     var stanzas = [], header = null, pattern = null;
 3     
 4     header = canvas.innerHTML.match(new RegExp(
 5         openTagAllAttrsEx("table", [ ["bgcolor", "#e5ecf9"], ["border", "0"],
 6                                      ["cellpadding", "0"], ["cellspacing", "0"],
 7                                      ["width", "100%"] ]) +
 8         anyText + closeTag("table")))[0];
 9 
10     // Push the text `Results 1 - n of about x,xxx,xxx...' into its own row.
11     // It would be nice to remove the query run time for screen real estate
12     // reasons.  \(.*\d?\.\d{1,4}.*\)(.*) almost does it, but fails on .cn
13     header = header.replace(
14         new RegExp(makeSubExp(closeTag("td")) +
15                    makeSubExp(openTagAnyAttrsInc("td")) +
16                    makeSubExp(openTagAllAttrsEx("font", [["size", "-1"]]) +
17                               anyText)),
18         "$1</tr><tr>$2&nbsp;&nbsp;$3");
19     stanzas.push(header);
20 
21     searchResult = new RegExp(openTagAllAttrsEx("p", [["class", "g"]]) +
22                               anyText + closeTag("p"),
23                               "g");
24     for(var m; m = searchResult.exec(canvas.innerHTML);) { stanzas.push(m[0]); }
25 
26     canvas.innerHTML = 
27         "<span style='font-size:0.8em'>" + stanzas.join("") + "</span>";
29     return canvas; // Filter functions *MUST* return the canvas object!
30 }

iGMonkey makes it easy to build regular expressions to parse and reformat Google search results for further filtering or display

Lines 4-8 use the iGMonkey regex utility functions and variables to create a regular expression that matches the header bar from Google's search results page (shown below), and extract it using the javascript String object's built-in match method.

Google Search Results Header Bar

Google Search Results Header Bar

5-7
openTagAllAttrsEx is an iGMonkey regex utility function that generates a regular expression to match a specified HTML or XML open tag with all the specified attributes (and values) exclusively (or exactly). iGMonkey also provides openTagAllAttrsInc, i.e. inclusive, openTagAnyAttrsEx, and openTagAnyAttrsInc.
8

anyText is a regex utility variable that holds a regular expression that will non-greedy match any text. Here, it matches the contents between the table open and close tags. anyText requires that some text, including whitespace, be present. Use optAnyText if you want a non-greedy match for some optional text.

You can probably guess that closeTag generates a regular expression to match a specified HTML or XML close tag.

As the comment on lines 10-12 indicates, lines 13-18 move some of the text in the header bar into a separate row, thus splitting the header bar into two rows. Line 19 pushes the modified header onto an array where the preliminary parsing and reformatting results are stored.

14-17

makeSubExp encloses its argument in parentheses, thus creating a subexpression that you may refer to later. Subexpressions are numbered from left to right, counting from 1.

Notice the openTagAnyAttrsInc call with only one argument, a tag name, on line 15. The call omits the second argument, normally an array of attribute/value arrays. This will match the specified tag regardless of what attributes it has, including when it has no attributes.

18
This is the replacement expression. The text that each numbered subexpression matches replaces the corresponding $n placeholder.

Lines 21-24 parse out each individual search result, and push them on to the preliminary results array. All the igMonkey regex utilities used in these lines were already described. Note that I pass in the g (global operation) flag to the RegExp constructor, and call the RegExp object's exec method repeatedly to obtain each search result.

Lines 26-28 reassemble the contents of the preliminary results array, wrap them all up with a tag that reduces the font to a size suitable for display in Homepage Module, and assign the resulting string back to the canvas's innerHTML property.

Finally, line 29 returns the modified canvas object. As the comment indicates, filter functions must return a canvas. If you forget to do this, it will cause strange, hard to find bugs.

Document Object Model Example

pre-monkey Module Source:
google.com_v_google.cn.dom.pre-im.xml
iGMonkey Module Source:
google.com_v_google.cn.dom.im.1.0.xml

As you can see, the following DOM examples are slightly longer than the regular expression examples. Fortunately, the DOM approach has benefits over regular expressions that offset the drawback of slightly more verbose code. The DOM API provides a rich interface for selecting, and manipulating the elements of an HTML or XML document, and iGMonkey's DOM utilities aim to enhance this by automating common tasks. Furthermore, DOM applications tend to be more robust than regular expression applications. Choosing the Right Approach explains why.

sFltrGoogleResultsDOM__MODULE_ID__ is quite similar to the pre-monkey version, but it uses iGMonkey utilties to make the code a little clearer.

filter_google_results_dom__MODULE_ID__ Source

 1 function filter_google_results_dom__MODULE_ID__ (canvas) {
 2     var frag = document.createDocumentFragment();
 3     var divs = canvas.getElementsByTagName("div");
 4     var tables = canvas.getElementsByTagName("table");
 5     var stanzas = [], i = 0;
 6 
 7     // Extract Google Search Results Header Bar
 8     for (i = 0; i < tables.length; i++) {
 9         if (("#e5ecf9" == tables[i].bgColor) && ("100%" == tables[i].width)) {
10             var tds = tables[i].getElementsByTagName("td");
11             var container = tds[1].parentNode.parentNode;
12             tds[1].style.paddingLeft = "0.5em";
13             container.appendChild(document.createElement("tr"));
14             container.lastChild.appendChild(tds[1]); // Wherever you go, there you are
15             frag.appendChild(tables[i]);
16             break;
17         }
18     }
19 
20     // Extract Search Results
21     for (i = 0; i < divs.length; i++) {
22         var pars = divs[i].getElementsByTagName("p");
23         if (pars.length > stanzas.length) { stanzas = pars; }
24     }
25 
26     frag.appendChild(document.createElement("span"));
27     frag.lastChild.style.fontSize = "0.8em";
28     while (stanzas.length > 0) { frag.lastChild.appendChild(stanzas[0]); }
29 
30     canvas.innerHTML = "";
31     canvas.appendChild(frag);
32 }

Function that uses the DOM API to parse and reformat Google search results for display

Annotated sFltrGoogleResultsDOM__MODULE_ID__ Source

 1 function sFltrGoogleResultsDOM__MODULE_ID__ (canvas) {
 2     var frag = document.createDocumentFragment();
 3     var divs = canvas.getElementsByTagName("div");
 4     var tables = canvas.getElementsByTagName("table");
 5     var stanzas = [], i = 0;
 6 
 7     // Extract Google Search Results Header Bar
 8     for (i = 0; i < tables.length; i++) {
 9         if (hasAllAttrsInc(tables[i],
10                            [["bgcolor", "#e5ecf9"], ["width", "100%"]])) {
11             var tds = tables[i].getElementsByTagName("td");
12             var container = tds[1].parentNode.parentNode;
13             tds[1].style.paddingLeft = "0.5em";
14             container.appendChild(document.createElement("tr"));
15             container.lastChild.appendChild(tds[1]); // Wherever you go, there you are
16             frag.appendChild(tables[i]);
17             break;
18         }
19     }
20 
21     // Extract Search Results
22     foreach(divs, function (div) {
23         var pars = div.getElementsByTagName("p");
24         if (pars.length > stanzas.length) { stanzas = pars; }        
25     });
26 
27     frag.appendChild(document.createElement("span"));
28     frag.lastChild.style.fontSize = "0.8em";
29     while (stanzas.length > 0) { frag.lastChild.appendChild(stanzas[0]); }
30 
31     canvas.innerHTML = "";
32     canvas.appendChild(frag);
33     return canvas; // Filter functions *MUST* return the canvas object!
34 }

iGMonkey filter that uses the DOM API to parse and reformat Google search results for further filtering or display

Lines 8-19 loop through all the tables in the document to find the Google search results header bar (shown below). The header bar is then split into two rows, and appended to a DocumentFragment that holds the preliminary parsing and reformatting results. Once this is complete, the code breaks out of the loop. Notably, this code does not use iGMonkey's foreach function because this would require a labeled break statement. At the time of this writing, labeled break statements did not appear to perform properly in some browsers.

Google Search Results Header Bar

Google Search Results Header Bar

9,10

This hasAllAttrsInc call returns true when it finds the search results header bar. hasAllAttrsInc is an iGMonkey DOM utility function that tests if the given Node has all the specified attributes (and values) inclusively. That is, the Node must contain all the specified attributes and attribute/value pairs, and it may contain additional attributes.

hasAllAttrsInc offers two benefits over the test clause on line 9 of the pre-monkey version. First, it is more readable than a series of multiple test clauses joined together with the && operator, particularly when there are many clauses. Second, it avoids subtle bugs when referencing attribute names that arise from differences between the capitalization convention used by javascript versus that used by HTML and XML, cf. bgColor vs. bgcolor.

11-15
This code creates a new table row (tr) Node, adds it to the header bar table, and moves the first row's second table data (td) Node into the new row.
16
Finally, the header bar table is appended to frag, a DocumentFragment object. Since frag is not part of the document, this removes the header bar from the document.

Lines 22-25 find the search results. Each search result in Google's search results page is a paragraph (p) element. This code takes advantage of the fact that the results are enclosed in a div element. Furthermore, the particular div that contains the results contains more paragraph elements than any other div in the page.

22-25
iGMonkey's foreach function is a higher order procedure that executes a given function for each element in an array. To understand this foreach loop, compare it to the for loop on lines 21-24 of the pre-monkey version. Both loops accomplish the same task.

Line 31 clears the canvas by setting canvas.innerHTML to the empty string. In the following line, 32, the reformatted results are drawn on the canvas by appending the DocumentFragment object to the canvas.

The last line in the sFltrGoogleResultsDOM__MODULE_ID__, line 33 returns the modified canvas object. As the comment indicates, filter functions must return a canvas. To reiterate the warning from the regex example, failing to do this will cause strange, hard to find bugs.

NXSL (XPath) Example

Examples Coming Soon!

NXSL is an implementation of the XML Path Language (XPath) specification.

Remote Procedure Call {SOAP, XML-RPC, REST} Example

Examples Coming Eventually!

The three most common remote procedure call (RPC) mechanisms available on the web today are SOAP, XML-RPC, and REST.

Choosing the Right Approach

Deciding which of the four approaches described above, regular expressions, the DOM API, the NXSL API, and remote procedure call mechanisms, works best depends on your needs and the data you want to manipulate. There are no hard and fast rules, but here are my thoughts on the pros and cons of each: