Xapian 初体验之 hello xapian
文件夹结构:
~/helloxapian
~/helloxapian/indexfiles.py
~/helloxapian/search.py
~/helloxapian/test
~/helloxapian/test/hello.txt
~/helloxapian/test/world.txt
~/helloxapian/test/abc.txt
hello.txt文件内容:
- Welcome to the Xapian project website.Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, and Ruby (so far!) Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators. If you're after a packaged search engine for your website, you should take a look at Omega: an application we supply built upon Xapian. Unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow.
world.txt文件内容:
The 0.9 branch features a few API changes, the most notable being a rewritten QueryParser which is reentrant, has encapsulated internals, and parses better than the old one. Note that the examples are now a subdirectory of xapian-core, so there is no longer a separate xapian-examples download (most of the size of the xapian-examples download was due to configure and other generated files!)
abc.txt文件内容:
- Documentation A number of pieces of documentation are available.We suggest you start by reading the Installation Guide, which covers downloading the code, and unpacking, configuring, building and installing. It then shows how to build the example programs.For a quick introduction to our software, including a walk-through example of an application for searching through some data, read the Quickstart document.The Overview explains the API which Xapian provides to programmers.Much useful documentation is automatically extracted from the source code. Full documentation of the API is available for users. For those wishing to do development work on the Xapian library itself, documentation of the internals is available, and there's a short document outlining the directory structure which is automatically generated from the source code.
indexfiles.py文件:
#!/usr/bin/env python
#coding=utf-8
import sys
import xapian
import string
from os import listdir
import re
rex=re.compile('[a-zA-Z0-9]+') #给英文内容进行简单的分词,暂时不涉及中文
MAX_TERM_LENGTH = 64 #设置一个关键词的最大长度
DBPATH='indexdb' #索引文件目录
if len(sys.argv) < 2:
print >> sys.stderr, "缺少参数,请提供需要建立索引的目录"
- sys.exit(1)
try:
database = xapian.WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN)
- stemmer = xapian.Stem("english")#针对英文进行处理
- for file in listdir(sys.argv[1]):
- if file[-4:]=='.txt':
- filename=sys.argv[1] + '/' + file
- try:
- fr=open(filename,'r')
- content=fr.read()
- fr.close()
- content=string.strip(content)
- doc = xapian.Document()#新建一个Document,相当于一条记录
- doc.set_data(content)
- doc.add_value(0,filename)#添加一个value,记录文件名 doc.add_term(file[:-4])#把文件名做为一个关键词
- pos = 0
- terms=rex.findall(content)#从文本内容里找出每一个词
- for term in terms:
if len(term) > MAX_TERM_LENGTH:
- term=termMAX_TERM_LENGTH
- pos += 1
- pass
except Exception, e:
print >> sys.stderr, "Exception: %s" % str(e)
- sys.exit(1)