XQuery/Tag Cloud
Counting Words
[edit | edit source]A tag cloud (or weighted list in visual design) is a visual depiction of user-generated tags, or simply the word content of a site, typically used to describe the content of web sites.
One method of creating a tag cloud is to create a list of the words in a document, count the number of occurrences of each word, and depict the more frequently occurring words with a larger font size than the words that occur less frequently.
Counting the total number of words in a text object
[edit | edit source]To get a feeling for one of the basic techniques, let's examine Jon Robie's code, which takes all of the text nodes in a document, strings them together, splits them into a sequence of "words" (tokenizing by whitespace, punctuation, or the 'nbsp' entity), and counts the number of resulting words:
let $txt := string-join( $doc//text() , " ")
return
count(tokenize($txt,'(\s|[,.!:;]|[n][b][s][p][;])+'))
Note that the string-join() function here takes an input sequence and returns a single string that is separated by single spaces (the second argument of string-join).
If you want to see what this routine treats as a "word" in your document, use the following variation.
let $txt := string-join( $doc//text() , " ")
let $words := tokenize($txt,'(\s|[,.!:;]|[n][b][s][p][;])+')
return
<words count="{count($words)}">
{ for $word in $words return <word>{$word}</word> }
</words>
Another variation is the word-count() function found at xqueryfunctions.com:
declare function local:word-count( $arg as xs:string? ) as xs:integer {
count(tokenize($arg, '\W+')[. != ''])
} ;
This version uses the \W+
regular expression (which matches non-alphabetical characters) to return word tokens.
Counting Keywords
[edit | edit source]Kurt Cagle suggested the following XQuery for counting keywords:
declare namespace xqwb="http://xquery.wikibooks.org";
declare function xqwb:word-count($wordlist as element() ) as element() {
<terms>
{for $term in distinct-values($wordlist/term)
let $term-count := count($wordlist/term[. = $term])
return
<term count="{$term-count}">{$term}</term>
}
</terms>
};
let $keywords :=
<keywords>
<term>red</term>
<term>green</term>
<term>red</term>
<term>blue</term>
<term>violet</term>
<term>red</term>
<term>blue</term>
<term>blue</term>
<term>red</term>
<term>orange</term>
<term>green</term>
<term>yellow</term>
<term>indigo</term>
<term>red</term>
</keywords>
let $result := xqwb:word-count($keywords)
return $result
This Returns the Following
[edit | edit source]<terms>
<term count="5">red</term>
<term count="2">green</term>
<term count="3">blue</term>
<term count="1">violet</term>
<term count="1">orange</term>
<term count="1">yellow</term>
<term count="1">indigo</term>
</terms>
Creating a Tag Cloud
[edit | edit source]From this you can create a Tag Cloud or word density map such as the "Popular Tags" link on the flickr web site Flicker Popular Tags
declare namespace xqwb="http://xquery.wikibooks.org";
declare option exist:serialize "method=xhtml media-type=text/html indent=yes";
declare function xqwb:word-count($wordlist as element() ) as element() {
<terms>
{for $term in distinct-values($wordlist/term)
let $term-count := count($wordlist/term[. = $term])
return
<term count="{$term-count}">{$term}</term>
}
</terms>
};
let $keywords :=
<keywords>
<term>red</term>
<term>green</term>
<term>red</term>
<term>blue</term>
<term>violet</term>
<term>red</term>
<term>blue</term>
<term>blue</term>
<term>red</term>
<term>orange</term>
<term>green</term>
<term>yellow</term>
<term>indigo</term>
<term>red</term>
</keywords>
let $result := xqwb:word-count($keywords)
let $total := count($keywords/term)
let $scale := 20
return
<div>
{
for $term in $result/term
let $fontSize := round( $term/@count div $total * 100 * $scale)
order by $term
return <span style="font-size:{$fontSize}%">{string($term)} </span>
}
</div>