![]() |
![]() |
![]() |
| The Application | Innocuous Query | Censored Query |
Google.cn, a local version of the Google search engine for China, launched in January 2006, and sparked a flurry of arguments over the relative ‘evilness’ of this act. The crux of the matter was that the Chinese government refused to allow Google.cn to operate, unless it agreed to censor specific search results at the government's request.
Unfortunately, informed debate is not possible without good information, and, at the time of this writing, Google and the Chinese government are the only parties that know what results Google.cn censors. Happily, the Google Homepage API makes it possible to inform this debate using Google's own technology!
The example application is a mashup that provides a simple way for anyone to simultaneously query both Google.com, and Google.cn, and compare the results. The Homepage API makes it possible. iGMonkey makes it easy.
Click
to add the iGMonkey
example application
to your Google Personalized Homepage. You may also add it manually by following these steps:
http://www.csua.berkeley.edu/~dans/google/mashups/gvg/google.com_v_google.cn.xmlTry out the iGMonkey example application that now appears on your personalized homepage.
Each example links to the source code of two complete, functional Homepage API modules. The first link is a version of the module that does not take advantage of iGMonkey. The second link is a version utilizes of iGMonkey's retrieval and rendering pipeline, and leverages the utility filters and functions that iGMonkey provides.
The substantive differences between the examples of each approach are
found in their respective
sFltrGoogleResultsapproach__MODULE_ID__
functions. Each sFltrGoogleResultsapproach__MODULE_ID__
method is an iGMonkey result filter that
uses the specified approach to parse
Google's search results, and
format the results for display. The annotated filter function sources
appear on this page, and differences between the pre-monkey functions
are highlighted.
Don't be deceived by
sFltrGoogleResultsRegex__MODULE_ID__'s larger line
count. The pre-monkey version appears shorter because it packs a lot
of its functionality into cryptic, hard to modify, regular expressions
one-liners. Nasty, opaque regexes like these are a hallmark of many
perl programs that lead haters to
derisively refer to it as a “read-only lanuage.”
I'm no hater, but I believe iGMonkey's utility functions make the second version clearer, easier to understand, and maintain. Furthermore, I know the iGMonkey version is significantly easier to write. Did I mention that iGMonkey's utility functions produce regular expressions which are designed to be more robust?
filter_google_results_regex__MODULE_ID__ Source 1 function filter_google_results_regex__MODULE_ID__ (html) {
2 var stanzas = [], header;
3 var pattern = /<p class=g>(?:.|\s)+?(<p class=g>|<\/div>)/g;
4
5 header = html.match(/<table width=100% border=0 cellpadding=0 cellspacing=0 bgcolor=#e5ecf9>(?:.|s)+?<\/table>/)[0];
6 // Push the text `Results 1 - n of about x,xxx,xxx...' into its own row.
7 // It would be nice to remove the query run time for screen real estate
8 // reasons. \(.*\d?\.\d{1,4}.*\)(.*) almost does it, but fails on .cn
9 header = header.replace(/(<\/td>)(<td.*?>)(<font size=-1>.+)/,
10 "$1</tr><tr>$2 $3");
11 stanzas.push(header);
12
13 for(var m; m = pattern.exec(html);) {
14 pattern.lastIndex -= m[1].length;
15 stanzas.push(m[0].substring(0, m[0].length - m[1].length));
16 }
17
18 return "<span style='font-size:0.8em'>" + stanzas.join("") + "</span>";
19 // return filter_view_source__MODULE_ID__(stanzas.join());
20 }
Function that uses regular expressions to parse and reformat Google search results for display
sFltrGoogleResultsRegex__MODULE_ID__ Source 1 function sFltrGoogleResultsRegex__MODULE_ID__ (canvas) {
2 var stanzas = [], header = null, pattern = null;
3
4 header = canvas.innerHTML.match(new RegExp(
5 openTagAllAttrsEx("table", [ ["bgcolor", "#e5ecf9"], ["border", "0"],
6 ["cellpadding", "0"], ["cellspacing", "0"],
7 ["width", "100%"] ]) +
8 anyText + closeTag("table")))[0];
9
10 // Push the text `Results 1 - n of about x,xxx,xxx...' into its own row.
11 // It would be nice to remove the query run time for screen real estate
12 // reasons. \(.*\d?\.\d{1,4}.*\)(.*) almost does it, but fails on .cn
13 header = header.replace(
14 new RegExp(makeSubExp(closeTag("td")) +
15 makeSubExp(openTagAnyAttrsInc("td")) +
16 makeSubExp(openTagAllAttrsEx("font", [["size", "-1"]]) +
17 anyText)),
18 "$1</tr><tr>$2 $3");
19 stanzas.push(header);
20
21 searchResult = new RegExp(openTagAllAttrsEx("p", [["class", "g"]]) +
22 anyText + closeTag("p"),
23 "g");
24 for(var m; m = searchResult.exec(canvas.innerHTML);) { stanzas.push(m[0]); }
25
26 canvas.innerHTML =
27 "<span style='font-size:0.8em'>" + stanzas.join("") + "</span>";
29 return canvas; // Filter functions *MUST* return the canvas object!
30 }
iGMonkey makes it easy to build regular expressions to parse and reformat Google search results for further filtering or display
Lines 4-8 use the
iGMonkey regex utility
functions and variables to create a regular expression that matches
the header bar from
Google's search results page
(shown below), and extract it using the
javascript
String object's built-in match method.
Google Search Results Header Bar
openTagAllAttrsEx is an
iGMonkey regex utility function that
generates a regular expression to match a specified HTML or XML
open tag with all the specified
attributes (and values) exclusively
(or exactly). iGMonkey also
provides openTagAllAttrsInc, i.e.
inclusive, openTagAnyAttrsEx, and
openTagAnyAttrsInc.
anyText is a regex utility variable that holds a
regular expression that will non-greedy match any text.
Here, it matches the contents between the table open and close
tags. anyText requires that some text, including
whitespace, be present. Use optAnyText if you want a
non-greedy match for some optional text.
You can probably guess that closeTag generates a
regular expression to match a specified HTML or XML close
tag.
As the comment on lines 10-12 indicates, lines
13-18 move some of the text in the header bar into
a separate row, thus splitting the header bar into two rows. Line
19 pushes the modified header onto an array
where the preliminary parsing and reformatting results are stored.
makeSubExp encloses its argument in parentheses, thus
creating a subexpression that you may refer to later.
Subexpressions are numbered from left to right, counting from 1.
Notice the openTagAnyAttrsInc call with only one
argument, a tag name, on line 15. The call
omits the second argument, normally an array of attribute/value
arrays. This will match the specified tag regardless of what
attributes it has, including when it has no attributes.
$n placeholder.
Lines 21-24 parse out each individual search result, and
push them on to the preliminary results array. All the
igMonkey regex utilities used in these
lines were already described. Note that I pass in the g
(global operation) flag to the RegExp constructor, and
call the RegExp object's exec method
repeatedly to obtain each search result.
Lines 26-28 reassemble the contents of the
preliminary results array, wrap them all up with a tag that reduces
the font to a size suitable for display in Homepage Module, and assign
the resulting string back to the canvas's innerHTML property.
Finally, line 29 returns the modified canvas
object. As the comment indicates, filter functions
must return a canvas. If you forget to do
this, it will cause strange, hard to find bugs.
As you can see, the following DOM examples are slightly longer than the regular expression examples. Fortunately, the DOM approach has benefits over regular expressions that offset the drawback of slightly more verbose code. The DOM API provides a rich interface for selecting, and manipulating the elements of an HTML or XML document, and iGMonkey's DOM utilities aim to enhance this by automating common tasks. Furthermore, DOM applications tend to be more robust than regular expression applications. Choosing the Right Approach explains why.
sFltrGoogleResultsDOM__MODULE_ID__ is quite similar to
the pre-monkey version, but it uses
iGMonkey utilties to make the code a
little clearer.
filter_google_results_dom__MODULE_ID__ Source 1 function filter_google_results_dom__MODULE_ID__ (canvas) {
2 var frag = document.createDocumentFragment();
3 var divs = canvas.getElementsByTagName("div");
4 var tables = canvas.getElementsByTagName("table");
5 var stanzas = [], i = 0;
6
7 // Extract Google Search Results Header Bar
8 for (i = 0; i < tables.length; i++) {
9 if (("#e5ecf9" == tables[i].bgColor) && ("100%" == tables[i].width)) {
10 var tds = tables[i].getElementsByTagName("td");
11 var container = tds[1].parentNode.parentNode;
12 tds[1].style.paddingLeft = "0.5em";
13 container.appendChild(document.createElement("tr"));
14 container.lastChild.appendChild(tds[1]); // Wherever you go, there you are
15 frag.appendChild(tables[i]);
16 break;
17 }
18 }
19
20 // Extract Search Results
21 for (i = 0; i < divs.length; i++) {
22 var pars = divs[i].getElementsByTagName("p");
23 if (pars.length > stanzas.length) { stanzas = pars; }
24 }
25
26 frag.appendChild(document.createElement("span"));
27 frag.lastChild.style.fontSize = "0.8em";
28 while (stanzas.length > 0) { frag.lastChild.appendChild(stanzas[0]); }
29
30 canvas.innerHTML = "";
31 canvas.appendChild(frag);
32 }
Function that uses the DOM API to parse and reformat Google search results for display
sFltrGoogleResultsDOM__MODULE_ID__ Source 1 function sFltrGoogleResultsDOM__MODULE_ID__ (canvas) {
2 var frag = document.createDocumentFragment();
3 var divs = canvas.getElementsByTagName("div");
4 var tables = canvas.getElementsByTagName("table");
5 var stanzas = [], i = 0;
6
7 // Extract Google Search Results Header Bar
8 for (i = 0; i < tables.length; i++) {
9 if (hasAllAttrsInc(tables[i],
10 [["bgcolor", "#e5ecf9"], ["width", "100%"]])) {
11 var tds = tables[i].getElementsByTagName("td");
12 var container = tds[1].parentNode.parentNode;
13 tds[1].style.paddingLeft = "0.5em";
14 container.appendChild(document.createElement("tr"));
15 container.lastChild.appendChild(tds[1]); // Wherever you go, there you are
16 frag.appendChild(tables[i]);
17 break;
18 }
19 }
20
21 // Extract Search Results
22 foreach(divs, function (div) {
23 var pars = div.getElementsByTagName("p");
24 if (pars.length > stanzas.length) { stanzas = pars; }
25 });
26
27 frag.appendChild(document.createElement("span"));
28 frag.lastChild.style.fontSize = "0.8em";
29 while (stanzas.length > 0) { frag.lastChild.appendChild(stanzas[0]); }
30
31 canvas.innerHTML = "";
32 canvas.appendChild(frag);
33 return canvas; // Filter functions *MUST* return the canvas object!
34 }
iGMonkey filter that uses the DOM API to parse and reformat Google search results for further filtering or display
Lines 8-19 loop through all the tables in the
document to find the Google search results header bar (shown below).
The header bar is then split into two rows, and appended to a
DocumentFragment that holds the preliminary parsing and
reformatting results. Once this is complete, the code breaks out of
the loop. Notably, this code does not use
iGMonkey's foreach function
because this would require a labeled break statement. At the time of
this writing, labeled break statements did not appear to perform
properly in some browsers.
Google Search Results Header Bar
This
hasAllAttrsInc call returns true when it finds the
search results header bar. hasAllAttrsInc is an
iGMonkey DOM utility function that
tests if the given
Node
has all the specified attributes
(and values) inclusively. That is, the
Node
must contain all the specified attributes and
attribute/value pairs, and it may contain additional
attributes.
hasAllAttrsInc offers two benefits over
the test clause on line 9 of the pre-monkey
version. First, it is more readable than a series of multiple
test clauses joined together with the && operator,
particularly when there are many clauses. Second, it avoids
subtle bugs when referencing attribute names that arise from
differences between the capitalization convention used by
javascript
versus that used by HTML and XML, cf. bgColor vs.
bgcolor.
tr)
Node,
adds it to the header bar table, and moves the
first row's second table data (td)
Node
into the new row.frag,
a DocumentFragment object. Since frag
is not part of the document, this removes the header bar from the
document.
Lines 22-25 find the search results. Each search
result in
Google's search results page
is a paragraph (p) element. This code takes advantage of
the fact that the results are enclosed in a div element.
Furthermore, the particular div that contains the results
contains more paragraph elements than any other div in
the page.
foreach
function is a
higher
order procedure that executes a given function for
each element in an array. To understand
this foreach loop, compare it to the for
loop on lines 21-24 of the pre-monkey
version. Both loops accomplish the same task.
Line 31 clears the canvas by setting
canvas.innerHTML to the empty string. In the following
line, 32, the reformatted results are drawn
on the canvas by appending the DocumentFragment object to
the canvas.
The last line in the sFltrGoogleResultsDOM__MODULE_ID__,
line 33 returns the modified canvas object. As
the comment indicates, filter functions must
return a canvas. To reiterate the warning from the regex example,
failing to do this will cause strange, hard to find bugs.
Examples Coming Soon!
NXSL is an implementation of the XML Path Language (XPath) specification.
Examples Coming Eventually!
The three most common remote procedure call (RPC) mechanisms available on the web today are SOAP, XML-RPC, and REST.
Deciding which of the four approaches described above, regular expressions, the DOM API, the NXSL API, and remote procedure call mechanisms, works best depends on your needs and the data you want to manipulate. There are no hard and fast rules, but here are my thoughts on the pros and cons of each:
Regular expressions are great for rapid prototyping as well as quick and dirty hacks. Of the four methods, however, regular expressions probably lead to the most brittle applications because purely syntactic changes to a document can cause an existing regex to fail. That said, regular expressions are also the most universally applicable method. Thus, if none of the other methods are appropriate, you can always use regexes.
Notably, iGMonkey's regex utility functions generate regular expressions designed with robustness in mind, and are more flexible in the face of certain kinds of syntactic changes.
The Document Object Model API may require more effort upfront than regular expressions, but will often pay off with more robust and maintainable applications. Specifically, purely syntactic changes to a document can break a regex application, but can not break a DOM application. It takes a structural document change to potentially break a DOM application. Additionally, the DOM should work with any HTML or XML document, whereas NXSL will not work with older, non-XML versions of HTML.
iGMonkey's DOM utilities aim to automate some of the more common tasks you will encounter when working with the DOM.
The NXSL API is great if you want to work with pure XML formats like Atom and RSS. Unfortunately, NXSL won't work with most HTML documents on the web today because, as mentioned above, it doesn't support non-XML versions of HTML. An added caveat is that using NXSL requires taking the time to learn XPath.
iGMonkey does not feature any NXSL utility functions at this time, but I will add some if it appears useful to do so.
Remote procedure call mechanisms are the holy grail of data retrieval and manipulation since they make it easy to query precisely the information you want. The good news is that many popular web sites including Amazon.com, Flickr, and Google offer RPC APIs. The bad news is that most smaller sites don't, and many of the sites that do place limitations on the APIs. For example, many sites limit the number and type of RPC queries that an application may make during a given time period. Furthermore, RPC APIs may not expose all the functionality that a particular site offers. If these drawbacks will not impact your application, RPC APIs are arguably the best way to go.
iGMonkey does not include native support for the various RPC mechanisms, but I hope to add this in the near future. I will write this from scratch, if necesssary, but if you are of any open source SOAP, XML-RPC, REST, or other RPC libraries written in javascript that I may incorporate into iGMonkey, please let me know.