LinuxBasics.org

The community that helps people to run Linux

rss

Hi everybody,

we had implemented a way of searching the site and the list-archives using the same search-box at the top of each page.

While this was kind of neat, it severely slowed down the whole site in several ways:

The search was really slow, since it had to examine many more pages (1 mail to the list → one page in the wiki). Also, searches usually had many results that were of poor quality.

The second function that was really slow was the ‘backlinks’. This finds all pages in the wiki which link to the current page. (It probably uses ‘search’ internally).

Those two functions were so slow that they seemed to be broken.

So we decided to go back on the searchability of the mailing-list archives. I removed it from the wiki.

The form for mail search can be found on the Mailing list page at: The List's Archives section

Yours,

Stefan


Searchable Archives

This is how I made our Mailman/Pipermail archive searchable through the DokuWIki searchbox

Pipermail, the Mailman archiver, generates HTML-pages for all posts. The following script takes these pages, strips away everything that is not needed and puts the result in a .txt file inside a DokuWiki namespace.

#! /bin/bash
 
# rm /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/*
# fails because of too many files
find /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/ -iname "*.txt" | xargs rm
 
# Copy all posts from pipermail to the target directory:
find /usr/local/mailman/archives/public/qna/ \
  -iname "[1-90]*.html" \
  -exec cp {} /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive \;
 
 
# Remove <HEAD>, links, footer inserted by pipermail
 
for i in /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/*.html ; do
   echo "<html>" > /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/`basename $i .html`.txt
   sed -n  -e '/<H1>/,/<\/I>/p' \
           -e '/<!--beginarticle-->/,/<!--endarticle-->/p' \
       $i \
       >> /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/`basename $i .html`.txt
   echo "</html>" >> /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/`basename $i .html`.txt
done;
 
# Remove HTML-files
find /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/ -iname "*.html" | xargs rm

The sed-command prints only the lines between <H1> and </I>, and those between <!–beginarticle–> and <!–endarticle–>. This happens to be the information about the post, and the post itself. Thanks to seder's grab bag for their excellent examples.

The <html></html> around the extracted lines is to tell DokuWiki to allow HTML-markup.

All TXT-files are wiped before the work is done, and all HTML-files after they are no longer needed. Like that, if a post is removed from the Pipermail archives, it will also vanish from the DokuWiki archive.

A regular search for ‘network’ takes about 8 seconds, plus about the same time to display all results. ‘WiFi’, which has much less results, takes about the same time to search, but much less to display. The time to display may vary with your connection-speed.

It is NOT ADVISABLE to search for term that are in many posts, like ‘Anita’. Also, if you click on ‘index’ within the qnaarchive-namespace, you can easily go for coffee (or a pizza)…

Of course: Don’t forget to have a cronjob refreshing the archive :)

Stefan Waidele jun. 2005/06/20 22:17


Copyright (c) by the authors.
Prior to editing, authors agreed to license their contributions by the terms of the GPL.
See our licensing page for details.


Linux® is a registered trademark of Linus Torvalds.


 
  making/searchable_archives.txt · Last modified: 2008/07/20 21:08

LinuxBasics.org

Start Linux-Course Tutorials Linux Links Security Blog Forum E-mail List Search Online Chat

Site-Info

Help Get in Touch Making of LBo

Wiki-Control

Powered by

Linux Apache DokuWiki Mailman RUTE ht://Dig