Size: 309
Comment:
|
Size: 3328
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
Line 5: | Line 6: |
Line 6: | Line 8: |
Line 7: | Line 10: |
Line 8: | Line 12: |
Line 9: | Line 14: |
Line 10: | Line 16: |
Line 11: | Line 18: |
Line 12: | Line 20: |
. Welcome to the Xapian project website.Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, and Ruby (so far!) Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators. If you're after a packaged search engine for your website, you should take a look at Omega: an application we supply built upon Xapian. Unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow. |
|
Line 13: | Line 23: |
. The 0.9 branch features a few API changes, the most notable being a rewritten !QueryParser which is reentrant, has encapsulated internals, and parses better than the old one. Note that the examples are now a subdirectory of xapian-core, so there is no longer a separate xapian-examples download (most of the size of the xapian-examples download was due to configure and other generated files!) |
|
Line 14: | Line 26: |
. Documentation A number of pieces of documentation are available.We suggest you start by reading the Installation Guide, which covers downloading the code, and unpacking, configuring, building and installing. It then shows how to build the example programs.For a quick introduction to our software, including a walk-through example of an application for searching through some data, read the Quickstart document.The Overview explains the API which Xapian provides to programmers.Much useful documentation is automatically extracted from the source code. Full documentation of the API is available for users. For those wishing to do development work on the Xapian library itself, documentation of the internals is available, and there's a short document outlining the directory structure which is automatically generated from the source code. indexfiles.py文件: #!/usr/bin/env python #coding=utf-8 import sys import xapian import string from os import listdir import re rex=re.compile('[a-zA-Z0-9]+') MAX_TERM_LENGTH = 64 DBPATH='indexdb' if len(sys.argv) < 2: print >> sys.stderr, "缺少参数,请提供需要建立索引的目录" sys.exit(1) try: database = xapian.WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN) stemmer = xapian.Stem("english") for file in listdir(sys.argv[1]): if file[-4:]=='.txt': filename=sys.argv[1] + '/' + file try: fr=open(filename,'r') content=fr.read() fr.close() content=string.strip(content) doc = xapian.Document() doc.set_data(content) doc.add_value(0,filename) doc.add_term(file[:-4]) pos = 0 terms=rex.findall(content) for term in terms: if len(term) > MAX_TERM_LENGTH: term=term[:MAX_TERM_LENGTH] doc.add_posting(stemmer(term.lower()),pos) pos += 1 database.add_document(doc) except: pass except Exception, e: print >> sys.stderr, "Exception: %s" % str(e) sys.exit(1) |
Xapian 初体验之 hello xapian
文件夹结构:
~/helloxapian
~/helloxapian/indexfiles.py
~/helloxapian/search.py
~/helloxapian/test
~/helloxapian/test/hello.txt
~/helloxapian/test/world.txt
~/helloxapian/test/abc.txt
hello.txt文件内容:
- Welcome to the Xapian project website.Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, and Ruby (so far!) Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators. If you're after a packaged search engine for your website, you should take a look at Omega: an application we supply built upon Xapian. Unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow.
world.txt文件内容:
The 0.9 branch features a few API changes, the most notable being a rewritten QueryParser which is reentrant, has encapsulated internals, and parses better than the old one. Note that the examples are now a subdirectory of xapian-core, so there is no longer a separate xapian-examples download (most of the size of the xapian-examples download was due to configure and other generated files!)
abc.txt文件内容:
- Documentation A number of pieces of documentation are available.We suggest you start by reading the Installation Guide, which covers downloading the code, and unpacking, configuring, building and installing. It then shows how to build the example programs.For a quick introduction to our software, including a walk-through example of an application for searching through some data, read the Quickstart document.The Overview explains the API which Xapian provides to programmers.Much useful documentation is automatically extracted from the source code. Full documentation of the API is available for users. For those wishing to do development work on the Xapian library itself, documentation of the internals is available, and there's a short document outlining the directory structure which is automatically generated from the source code.
indexfiles.py文件: #!/usr/bin/env python #coding=utf-8 import sys import xapian import string from os import listdir import re rex=re.compile('[a-zA-Z0-9]+') MAX_TERM_LENGTH = 64 DBPATH='indexdb' if len(sys.argv) < 2: print >> sys.stderr, "缺少参数,请提供需要建立索引的目录" sys.exit(1) try: database = xapian.WritableDatabase(DBPATH, xapian.DB_CREATE_OR_OPEN) stemmer = xapian.Stem("english") for file in listdir(sys.argv[1]): if file[-4:]=='.txt': filename=sys.argv[1] + '/' + file try: fr=open(filename,'r') content=fr.read() fr.close() content=string.strip(content) doc = xapian.Document() doc.set_data(content) doc.add_value(0,filename) doc.add_term(file[:-4]) pos = 0 terms=rex.findall(content) for term in terms: if len(term) > MAX_TERM_LENGTH: term=term[:MAX_TERM_LENGTH] doc.add_posting(stemmer(term.lower()),pos) pos += 1 database.add_document(doc) except: pass except Exception, e: print >> sys.stderr, "Exception: %s" % str(e) sys.exit(1)