Diff for "xapian004" - Woodpecker Wiki for CPUG

Differences between revisions 3 and 23 (spanning 20 versions)

Xapian 初体验之 hello xapian

文件夹结构：

~/helloxapian

~/helloxapian/indexfiles.py

~/helloxapian/search.py

~/helloxapian/test

~/helloxapian/test/hello.txt

~/helloxapian/test/world.txt

~/helloxapian/test/abc.txt

hello.txt文件world.txt文件abc.txt文件是三个文本文件，内容是一段英文文章。

indexfiles.py文件：

#!/usr/bin/env python

#coding=utf-8

import sys

import xapian

import string

from os import listdir

import re

rex=re.compile('[a-zA-Z0-9]+') ＃给英文内容进行简单的分词，暂时不涉及中文

MAX_TERM_LENGTH = 64 #设置一个关键词的最大长度

DBPATH='indexdb' #索引文件目录

if len(sys.argv) < 2:

print >> sys.stderr, "缺少参数，请提供需要建立索引的目录"
sys.exit(1)

try:

database = xapian.WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN)
stemmer = xapian.Stem("english")＃针对英文进行处理
for file in listdir(sys.argv[1]):
- if file[-4:]=='.txt':
- filename=sys.argv[1] + '/' + file
- try:
  - fr=open(filename,'r')
  - content=fr.read()
  - fr.close()
  - content=string.strip(content)
  - doc = xapian.Document()#新建一个Document，相当于一条记录
  - doc.set_data(content)
  - doc.add_value(0,filename)#添加一个value，记录文件名
  - doc.add_term(file[-4:])#把文件名做为一个关键词
  - pos = 0
  - terms=rex.findall(content)#从文本内容里找出每一个词
  - for term in terms:
    - if len(term) > MAX_TERM_LENGTH:
      - term=termMAX_TERM_LENGTH
      doc.add_posting(stemmer(term.lower()),pos)#添加关键词，并记录关键词所在位置
    - pos += 1
    database.add_document(doc)
  except:
  - pass

except Exception, e:

print >> sys.stderr, "Exception: %s" % str(e)
sys.exit(1)

现在打开终端窗口，到~/helloxapian目录中执行python indexfiles.py test

就可以把test目录下的所有.txt文件都建立索引。

此时在~/helloxapian下会多出一个indexdb目录，这就是存储索引文件的目录。

search.py文件内容：

#!/usr/bin/env python

#coding=utf-8

import sys

import xapian

if len(sys.argv) < 2:

print >> sys.stderr, "缺少参数，请提供要查询的关键词"
sys.exit(1)
DBPATH='indexdb'
try:
- db = xapian.Database(DBPATH) #打开索引文件
- enquire = xapian.Enquire(db) #Enquire类是负责执行查询的
- stemmer = xapian.Stem('english')
- terms = []
- for term in sys.argv[1:]:
  - terms.append(stemmer(term.lower())) #将命令行参数中的所有关键词添加到要查找的关键词列表中
  query = xapian.Query(xapian.Query.OP_OR,terms) ＃Query是查找条件类，OP_OR说明各关键词之关的组合关系，还有OP_AND等等 enquire.set_query(query) #设置查询条件
- mset=enquire.get_mset(0,10) #获取前十条结果
- print '共搜索到结果：' + str(mset.get_matches_estimated())
- for match in mset:
  - doc=match[xapian.MSET_DOCUMENT]#得到一个Docuemnt对像 print '=========\r\n文件名:%s\r\n摘要:%s...' % (doc.get_value(0),doc.get_data()[ : 80])
except Exception, e:
- print >> sys.stderr, "Exception: %s" % str(e)
- sys.exit(1)

执行python search.py hello world或其它关键词，即可搜索到对应文件

头太晕

-  ⇤ ← Revision 3 as of 2007-01-26 15:36:57 → 
  Size: 2336
  Editor: wangzhen
  Comment:
+   ← Revision 23 as of 2009-12-25 07:15:29 → ⇥
  Size: 3387
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
- ~/helloxapian
+~/helloxapian
 Line 19:
+hello.txt文件world.txt文件abc.txt文件是三个文本文件，内容是一段英文文章。
-Line 20:
+Line 21:
+indexfiles.py文件：
-Line 21:
+Line 23:
- hello.txt文件内容：
+#!/usr/bin/env python
-Line 23:
+Line 25:
- Welcome to the Xapian project website.
Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, and Ruby (so far!)
Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.
If you're after a packaged search engine for your website, you should take a look at Omega: an application we supply built upon Xapian. Unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow.
+#coding=utf-8
-Line 28:
+Line 27:
- world.txt文件内容：
+import sys
-Line 30:
+Line 29:
- The 0.9 branch features a few API changes, the most notable being a rewritten QueryParser which is reentrant, has encapsulated internals, and parses better than the old one. Note that the examples are now a subdirectory of xapian-core, so there is no longer a separate xapian-examples download (most of the size of the xapian-examples download was due to configure and other generated files!)
+import xapian
-Line 32:
+Line 31:
-abc.txt文件内容：
+import string
-Line 34:
+Line 33:
-Documentation
A number of pieces of documentation are available.
+from os import listdir
-Line 37:
+Line 35:
-We suggest you start by reading the Installation Guide, which covers downloading the code, and unpacking, configuring, building and installing. It then shows how to build the example programs.
+import re
-Line 39:
+Line 37:
-For a quick introduction to our software, including a walk-through example of an application for searching through some data, read the Quickstart document.
+rex=re.compile('[a-zA-Z0-9]+') ＃给英文内容进行简单的分词，暂时不涉及中文
-Line 41:
+Line 39:
-The Overview explains the API which Xapian provides to programmers.
+MAX_TERM_LENGTH = 64 #设置一个关键词的最大长度
-Line 43:
+Line 41:
-Much useful documentation is automatically extracted from the source code. Full documentation of the API is available for users. For those wishing to do development work on the Xapian library itself, documentation of the internals is available, and there's a short document outlining the directory structure which is automatically generated from the source code.
+DBPATH='indexdb' #索引文件目录

if len(sys.argv) < 2:

 . print >> sys.stderr, "缺少参数，请提供需要建立索引的目录"
 . sys.exit(1)
try:

 . database = xapian.!WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN)
 . stemmer = xapian.Stem("english")＃针对英文进行处理
 . for file in listdir(sys.argv[1]):
  . if file[-4:]=='.txt':
  . filename=sys.argv[1] + '/' + file
  . try:
   . fr=open(filename,'r')
   . content=fr.read()
   . fr.close()
   . content=string.strip(content)
   . doc = xapian.Document()#新建一个Document，相当于一条记录
   . doc.set_data(content)
   . doc.add_value(0,filename)#添加一个value，记录文件名
   . doc.add_term(file[-4:])#把文件名做为一个关键词
   . pos = 0
   . terms=rex.findall(content)#从文本内容里找出每一个词
   . for term in terms:
    . if len(term) > MAX_TERM_LENGTH:
     . term=termMAX_TERM_LENGTH
    doc.add_posting(stemmer(term.lower()),pos)#添加关键词，并记录关键词所在位置
    . pos += 1
   database.add_document(doc)
  except:
   . pass
except Exception, e:

 . print >> sys.stderr, "Exception: %s" % str(e)
 . sys.exit(1)
现在打开终端窗口，到~/helloxapian目录中执行python indexfiles.py test

就可以把test目录下的所有.txt文件都建立索引。

此时在~/helloxapian下会多出一个indexdb目录，这就是存储索引文件的目录。

search.py文件内容：

#!/usr/bin/env python

#coding=utf-8

import sys

import xapian

if len(sys.argv) < 2:

 . print >> sys.stderr, "缺少参数，请提供要查询的关键词"
 . sys.exit(1)
 . DBPATH='indexdb'
 . try:
  . db = xapian.Database(DBPATH) #打开索引文件
  . enquire = xapian.Enquire(db) #Enquire类是负责执行查询的
  . stemmer = xapian.Stem('english')
  . terms = []
  . for term in sys.argv[1:]:
   . terms.append(stemmer(term.lower())) #将命令行参数中的所有关键词添加到要查找的关键词列表中
  query = xapian.Query(xapian.Query.OP_OR,terms) ＃Query是查找条件类，OP_OR说明各关键词之关的组合关系，还有OP_AND等等 enquire.set_query(query) #设置查询条件
  . mset=enquire.get_mset(0,10) #获取前十条结果
  . print '共搜索到结果：' + str(mset.get_matches_estimated())
  . for match in mset:
   . doc=match[xapian.MSET_DOCUMENT]#得到一个Docuemnt对像 print '=========\r\n文件名:%s\r\n摘要:%s...' % (doc.get_value(0),doc.get_data()[ : 80])
 except Exception, e:
  . print >> sys.stderr, "Exception: %s" % str(e)
  . sys.exit(1)
执行python search.py hello world或其它关键词，即可搜索到对应文件

[[torry|头太晕]]