XQuery/Page scraping and Yahoo Weather

Background

Yahoo provide a world weather forecast service via a REST API, delivering RSS. It is described in the API documentation.

However the key to each feed for UK towns is a Yahoo Location ID such as UKXX0953 and there is no service available to convert from location names to Yahoo codes. Yahoo does provide alphabetical index pages of locations which contain links to the feeds themselves.

Yahoo Pipe

This task can be accomplished by the Yahoo Pipe written by Paul Daniel. (up to the extraction of the location ID) However the inherent instability of HTML markup leads to the current failure of this pipeline.

XQuery

This script takes a location parameter, extracts the first letter of the location, constructs the URL of the yahoo weather index page for that letter, the index page for the letter B and fetches the page via the httpclient module in eXist. The page is not valid XHTML but the httpclient:get function cleans up the XML so it is well-formed.

HTML page

The page structure can be seen in the tree view.

Next this XML is navigated to locate the li element containing the location and strips out the code for that location. Finally this code is appended to the stem of the URL of the RSS page for this location, created a URL for the RSS feed at that location. RSS feed and the script then redirects to that URL.

This process can be visualized using a data flow diagram Diagram

declare variable $yahooIndex := "http://weather.yahoo.com/regional/UKXX";
declare variable $yahooWeather := "http://weather.yahooapis.com/forecastrss?u=c&amp;p=";

let $location := request:get-parameter("location","Bristol")
let $letter :=  upper-case(substring($location,1,1))
let $suffix := if($letter eq 'A') then '' else concat('_',$letter)
let $index := xs:anyURI(concat ($yahooIndex,$suffix,".html"))
let $page := httpclient:get($index,true(),())
let $href := $page//div[@id="yw-regionalloc"]//li/a[.= $location]/@href
let $code :=  substring-after(substring-before($href,'.'),'forecast/')
let $rss := xs:anyURI(concat($yahooWeather,$code) )

return 
   response:redirect-to ($rss)

Notes

Although the index page is not valid XHTML (why not?) and needs tidying, Yahoo have been helpful to the scrapper by using ids on the sections. This allows the XPath expression to pick out the relevant section by id, and then select the li containing the location. However such tagging is not stable, and in fact changed recently from an id of browse to the current yw-regionalloc. Note also that there is additional work required because the page for A has a different URL to the remainder of the letters -a feature not easily seen or tested for.
eXist is not ideally suited to this task since the page has to be first stored in the database so that XPath expressions can be executed using the structural index. An in-memory XQuery engine such as Saxon would be expected to perform better on this task. At present the performance is a bit slow but the new 1.3 release improves this situation.
Extracting the code from the string would be clearer with a regular expression, but XQuery does not provide a simple matching function to extract the matched pattern. An XQuery function which wraps some XSLT to do this is described in analyse-string
The script uses the eXist function response:redirect-to to re-direct the browser to the constructed URL for the RSS feed

XSLT

For comparison, here is the equivalent XSLT script, using analyse-string.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:param name="location"/>
    <xsl:variable name="html2xml">
        <xsl:text>http://www.html2xml.nl/Services/html2xml/version1/Html2Xml.asmx/Url2XmlNode?urlAddress=</xsl:text>
    </xsl:variable>
    <xsl:variable name="yahooIndex">
        <xsl:text>http://weather.yahoo.com/regional/UKXX_</xsl:text>
    </xsl:variable>
    <xsl:variable name="yahooWeather">
        <xsl:text>http://weather.yahooapis.com/forecastrss?u=c&amp;p=</xsl:text>
    </xsl:variable>
    <xsl:template match="/">
        <xsl:variable name="letter" select="upper-case(substring($location,1,1))"/>
        <xsl:variable name="suffix" select="if($letter eq 'A') then '' else concat('_',$letter)"></xsl:variable>
        <xsl:variable name="page" select="doc(concat ($html2xml,$yahooIndex,$suffix,'.html'))"/>
        <xsl:variable name="href" select="$page//div[@id='yw-regionalloc']//li/a[.= $location]/@href"/>
        <xsl:variable name="code" >
            <xsl:analyze-string select="$href" regex="forecast(.*)\.html">
                <xsl:matching-substring>
                    <xsl:value-of select="regex-group(1)"/>
                </xsl:matching-substring>
            </xsl:analyze-string>
            </xsl:variable>
         <xsl:variable name="rssurl" select="concat($yahooWeather,$code)"/>
        <xsl:copy-of select="doc($rssurl)"/>
    </xsl:template>
</xsl:stylesheet>

Bristol Weather - but currently broken

XPL

Another approach is to use XPL developed by Erik Bruchez and Alessandro Vernet at Orbeon to describe the sequence of transformations as a pipeline. Here the pipeline is extended to create a custom HTML page from the RSS feed.

<?xml version="1.0" encoding="UTF-8"?>
<p:pipeline xmlns:p="http://www.cems.uwe.ac.uk/xpl"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  >
    <p:output id="weatherPage"/>
    <p:processor name="xslt">
        <p:annotation>construct the index page url from the parameter</p:annotation>
        <p:input name="parameter" id="location"/>
        <p:input name="xml">
            <dummy/>
        </p:input>
        <p:input name="xslt">
            <xsl:template match="/">
                <xsl:text>http://weather.yahoo.com/regional/UKXX_</xsl:text>
                <xsl:value-of select="upper-case(substring($location,1,1))"/>
                <xsl:text>.html</xsl:text>
            </xsl:template>
        </p:input>
        <p:output name="result" id="indexUrl"/>
    </p:processor>
    <p:processor name="tidy">
         <p:annotation>tidy the index page</p:annotation>
        <p:input name="url" id="indexUrl"/>
        <p:output name="xhtml" id="indexXhtml"/>
    </p:processor>
    <p:processor name="xslt">
         <p:annotation>parse the index page and construct the URL for the RSS feed</p:annotation>
        <p:input name="xml" id="indexXhtml"/>
        <p:input name="parameter" id="location"/>
        <p:input name="xslt">
            <xsl:template match="/">
                <xsl:variable name="href" select="//div[@id='yw-regionalloc']//li/a[.= $location]/@href"/>
                <xsl:text>http://weather.yahooapis.com/forecastrss?u=c%26p=</xsl:text>
                <xsl:value-of select="substring-before(substring-after($href,'forecast/'),'.html')"
                />
            </xsl:template>
        </p:input>
        <p:output name="result" id="rssUrl"/>
    </p:processor>
    <p:processor name="fetch">
        <p:annotation>fetch the RSS feed</p:annotation>
        <p:input name="url" id="rssUrl"/>
        <p:output name="result" id="RSSFeed"/>
    </p:processor>
    <p:processor name="xslt">
        <p:annotation>Convert RSS to an HTML page</p:annotation>
        <p:input name="xml" id="RSSFeed"/>
        <p:input name="xslt" href="http://www.cems.uwe.ac.uk/xmlwiki/weather/yahooRSS2HTML.xsl"/>
        <p:output name="result" id="weatherPage"/>
    </p:processor>
</p:pipeline>

Given implementations for each of the named processor types, this can be executed (albeit rather slowly in this prototype XQuery processor )

This is a work in progress - at present this XPL engine is only a very simple, partial prototype, and even this simple sequential example is not conformant with the XPL schema (hence the local namespace).

The pipeline can be visualized using GraphViz.

The intention is to generate an additional image map to support linking to the underlying processes as well as support the full XPL language