Uninstall swish
To minimize downtime, create new index files before running "make install", by using Swish-e from the build directory. Then, copy the index files to the live location and run "make install":.
Here's another installation example. This example also shows building Swish-e in a "build" directory that is separate from where the source files are located. Without GNU Make, you will likely need to build from within the source directory, as shown in the previous example. Swish-e uses a configure script. Two options are of common interest: --prefix sets the top-level installation directory; --disable-shared will link Swish-e statically, which may be needed on some platforms Solaris 2. Platforms may require varying link instructions when libraries are installed in non-standard locations.
Swish-e uses the GNU autoconf tools for building the package. This may mean reading the manual page for your compiler and linker to see how to specify non-standard file locations.
In this example, we do not have root access. This is required because libxml2 installs a program that is used when running the configure script. Before running configure, type:. The -R option says to add a specified path or paths to those that are used to find shared libraries at run time.
These paths are stored in the Swish-e binary. When Swish-e is run, it will look in these directories for shared libraries. Some platforms may not support the -R option. Swish installs a number of files. Individual paths may also be set. Run configure --help for details. Note that the Perl modules are not installed in the system Perl library. Swish-e and the Perl scripts that require the modules know where to find the modules, but the perldoc program used for reading documentation does not.
Running "make install" installs some of the Swish-e documentation as man pages. The following man pages are installed:. The man pages are installed, by default, in the system man directory. This directory is determined when configure is run; it can be set by passing a directory name to configure. The man directory is specified relative to the --prefix setting. If you use --prefix , you do not normally need to also specify --mandir.
The Swish-e discussion list is the place to ask questions about installing and using Swish-e, see or post bug fixes or security announcements, and offer help to others. Please do not contact the developers directly. The list is typically very low traffic , so it won't overload your inbox. Please take the time to subscribe. If you are using Swish-e on a public site, please let the list know, so that your URL can be added to the list of sites that use Swish-e! Support for installation, configuration, and usage is available via the Swish-e discussion list.
Do not contact developers directly for help -- always post your question to the list. Please search the Swish-e list archive before posting a question. Swish-e has several switches e. First, try it without any changes to default settings:. If it still isn't working as you expect, try to reduce the test document to a very small example.
This will be very helpful to your readers, when you are asking for help. Another useful trick is to use -H9 when searching, to display full headers in search results. Look at the "Parsed Words" header to see what words Swish-e is searching for.
Use these guidelines when asking for help. The most important tip is to provide the least amount of information that can be used to reproduce your problem. Do not paraphrase output -- copy-and-paste -- but trim text that is not necessary.
The exact version of Swish-e that you are using. Running Swish-e with the -V switch will print the version number. Also, supply the output from uname -a or similar command that identifies the operating system you are running on. If you are running an old version of swish, be prepared for a response of "upgrade" to your question. A summary of the problem. This should include the commands issued e. Please copy-and-paste the exact commands and their output, instead of retyping, to avoid errors.
Include a copy of the configuration file you are using, if any. Swish-e has reasonable defaults, so in many cases you can run it without using a configuration file. But, if you need to use a configuration file, reduce it down to the absolute minimum number of commands that is required to demonstrate your problem.
Again, copy-and-paste. If you are having problems spidering a web server, use lwp-download or wget to copy the file locally, then make sure you can index the document using the file system method. This will help you determine if the problem is with spidering or indexing. If you expect help with spidering, don't post fake URLs, as it makes it impossible to test. If you don't want to expose your web page to the people on the Swish-e list, find some other site to test spidering on.
If that works, but you still cannot spider your own site, you may need to request help from others. If so, you must post your real URL or make a test document available via some other source.
If you are having trouble building Swish-e, please copy-and-paste the output from make or from. The Swish-e distribution includes a module that provides a Perl interface to the Swish-e C library.
This module provides a way to search a Swish-e index without running the Swish-e program. This section should give you a basic overview of indexing and searching with Swish-e. Other examples can be found in the conf directory; these will step you through a number of different configurations. Swish-e is a command-line program. The program is controlled by passing switches on the command line. A configuration file may be used, but often is not required.
Swish-e does not include a graphical user interface. Example CGI scripts are provided in the distribution, but they require additional setup to use. You may specify one or more files or directories with the -i option. By default, this will create an index called index. As mentioned above, Swish-e will index all files in a directory, unless instructed otherwise. Create a file called swish. The order of statements in the configuration file is typically not important, although some statements depend on previously set statements.
There are many possible settings. Good advice is to use as few settings as possible when first starting out with Swish-e. You may also see a summary of options by running:. Swish-e has two other methods for reading input files. This will spider the web server running on the local host.
The -S option defines the input source method to be "http", -i specifies the URL to spider, and -v sets the verbose level to two. There are a number of configuration options that are specific to the -S http input source. The -S http method is deprecated, however, in favor of a variation on the following input method. There is a general-purpose input method wherein Swish-e reads input from a program that produces documents in a special format.
The program might read and format data stored in a database, or parse and format messages in a mailing list archive, or run a program that spiders web sites like the previous method. The Swish-e distribution includes a spider program that uses this method of input.
This spider program is much more configurable and feature-rich than the previous -S http method. This says to use the -S prog input source method. Note that, in this case, the IndexDir setting does not specify a file or directory to index, but a program name to be run.
This program, spider. The SwishProgParameters option is a special feature that allows passing command-line parameters to the program specified with IndexDir. In this case, we are passing the word default which tells spider. Running a script under Windows requires specifying the interpreter e.
The advantage of the -S prog method of spidering over the previous -S http method is that the Perl code is only compiled once instead of once for every document fetched from the web server. In addition, it is a much more advanced spider with many, many features. Still, as used here, spider. This allows running Swish-e from a program instead of running the external program from Swish-e.
So, this also can be done as:. One final note about the -S prog input source method. The program specified with -i or IndexDir needs to be an absolute path. The exception is when the program is installed in the libexecdir directory. Then, a plain program name may be specified as in the example showing spider. Swish-e creates a reverse i.
Just like an index in a book, you look up a word and it lists the pages or documents where that word can be found. Swish-e can create multiple index tables within the same index file. For example, you might want to create an index that only contains words in HTML titles, so that searches can be limited to title text. So instead of checking for specific types of content you just pass the content type and the document to the SWISH::Filter module and it returns a new content type and document if it was filtered.
The filters that do the actual work are designed with a standard interface and work like filter "plug-ins". Adding new filters means just downloading the filter to a directory and no changes are needed to the spider's configuation file.
Download a filter for Postscript and next time you run indexing your Postscript files will be indexed. Since the filters are standardized, hopefully when you have the need to filter documents of a specific type there will already be a filter ready for your use.
Now, note that the perl modules may or may not do the actual conversion of a document. For example, the PDF conversion module calls the pdfinfo and pdftotext programs. Those programs part of the Xpfd package must be installed separately from the filters. The SwishSpiderConfig. By default the swishspider program the Perl helper script that fetches documents from the web will attempt to use the SWISH::Filter module if it can be found in Perls library path. This path is set automatically for spider.
Therefore, all that's required to use this system with -S http is setting the INC array to point to the filter directory. Of course, if you are concerned with indexing speed you should be using the -S prog method with spider. Here's two examples of how to run a filter program, one using Swish-e's FileFilter directive, another using a prog input method program.
See the SwishSpiderConfig. First, using the FileFilter method, here's the entire configuration file swish. Now, the same thing with using the -S prog document source input method and a Perl program called catfilter. You can see that's it's much more work than using the FileFilter method above, but provides a place to do additional processing.
In this example, the prog method is only slightly faster. But if you needed a perl script to run as a FileFilter then prog will be significantly faster. This example will probably not work under Windows due to the '- ' open. A simple piped open may work just as well:.
Perl will try to avoid running the command through the shell if meta characters are not passed to the open. See perldoc -f open for more information. See the examples in the conf directory and the comments in the SwishSpiderConfig. See the previous question for the details on filtering.
The method you decide to use will depend on how fast you want to index, and your comfort level with using Perl modules. Both the -S prog input method and filters use the popen system call to run the external program.
If your external program is, for example, a perl script, you have to tell Swish-e to run perl, instead of the script. Swish-e will convert forward slashes to backslashes when running under Windows. For example, you would need to specify the path to perl as assuming this is where perl is on your system :. Swish-e indexes 8-bit characters only. As long as they are listed in WordCharacters they will be indexed.
Actually, you probably can index any 8-bit character set, as long as you don't mix character sets in the same index and don't use libxml2 for parsing see below. You may specify the mapping of one character to another character with the TranslateCharacters directive. TranslateCharacters :ascii7: is a predefined set of characters that will translate eight-bit characters to ascii7 characters. Note: When using libxml2 for parsing, parsed documents are converted internally within libxml2 to UTF This is converted to ISO Latin-1 when indexing.
This will results in some words indexed incorrectly. Swish-e currently has no way to add or remove items from its index. But, Swish-e indexes so quickly that it's often possible to reindex the entire document set when a file needs to be added, modified or removed.
If you are spidering a remote site then consider caching documents locally compressed. Incremental additions can be handled in a couple of ways, depending on your situation. It's probably easiest to create one main index every night or every week , and then create an index of just the new files between main indexing jobs and use the -f option to pass both indexes to Swish-e while searching.
You can merge the indexes into one index instead of using -f , but it's not clear that this has any advantage over searching multiple indexes. One method is by using the -N switch to pass a file path to Swish-e when indexing. It will only index files that have a last modification date newer than the file supplied with the -N switch. This option has the disadvantage that Swish-e must process every file in every directory as if they were going to be indexed the test for -N is done last right before indexing of the file contents begin and after all other tests on the file have been completed -- all that just to find a few new files.
Also, if you use the Swish-e index file as the file passed to -N there may be files that were added after indexing was started, but before the index file was written. This could result in a file not being added to the index. Another option is to maintain a parallel directory tree that contains symlinks pointing to the main files. When a new file is added or changed to the main directory tree you create a symlink to the real file in the parallel directory tree.
Then just index the symlink directory to generate the incremental index. This option has the disadvantage that you need to have a central program that creates the new files that can also create the symlinks. But, indexing is quite fast since Swish-e only has to look at the files that need to be indexed. When you run full indexing you simply unlink delete all the symlinks. Both of these methods have issues where files could end up in both indexes, or files being left out of an index.
Use of file locks while indexing, and hash lookups during searches can help prevent these problems. It's true that indexing can take up a lot of memory! Swish-e is extremely fast at indexing, but that comes at the cost of memory. Another option is use the -e switch. This will require less memory, but indexing will take longer as not all data will be stored in memory while indexing. How much less memory and how much more time depends on the documents you are indexing, and the hardware that you are using.
Here's an example of indexing all. This first example is without -e and used about 84M of memory:. You can also build a number of smaller indexes and then merge together with -M. Using -e while merging will save memory. Finally, if you do build a number of smaller indexes, you can specify more than one index when searching by using the -f switch.
Sorting large results sets by a property will be slower when specifying multiple index files while searching. Some platforms report "too many open files" when using the -e economy option. The -e feature uses many temporary files something like plus the index files and this may exceed your system's limits. But, there's two things you can try:. The -e option will run Swish-e in economy mode, which uses the disk to store data while indexing.
This makes Swish-e run somewhat slower, but also uses less memory. If concerned about searching time, make sure you are using the -b and -m switches to only return a page at a time. If you know that your result sets will be large, and that you wish to return results one page at a time, and that often times many pages of the same query will be requested, you may be smart to request all the documents on the first request, and then cache the results to a temporary file.
The perl module File::Cache makes this very simple to accomplish. If possible, use the file system method -S fs of indexing to index documents in you web area of the file system.
This avoids the overhead of spidering a web server and is much faster. If this is impossible the web server is not local, or documents are dynamically generated , Swish-e provides two methods of spidering.
First, it includes the http method of indexing -S http. A perl helper script swishspider is included in the src directory to assist with spidering web servers.
There are example configurations for spidering in the conf directory. As of Swish-e 2. A number of example programs can be found in the prog-bin directory, including a program to spider web servers. The provided spider. The advantage of the "prog" document source feature over the "http" method is that the program is only executed one time, where the swishspider. The forking of Swish-e and compiling of the perl script can be quite expensive, time-wise.
The other advantage of the spider. And since it's a perl program there's no limit on the features you can add. Does the file swishspider exist where the error message displays?
If not, either set the configuration option SpiderDirectory to point to the directory where the swishspider program is found, or place the swishspider program in the current directory when running swish-e. If you are running Windows, make sure "perl" is in your path.
Try typing perl from a command prompt. If you not running windows, make sure that the shebang line the first line of the swishspider program that starts with! Also, make sure that you have execute and read permissions on swishspider. The spider. See perldoc spider. Swish cannot follow links generated by Javascript, as they are generated by the browser and are not part of the document. You can either merge -M two indexes into a single index, or use -f to specify more than one index while searching.
So that will only find documents with the word "foo" and where the file's path contains "sales". That might not works as well as you like, though, as both of these paths will match:.
The second option is a bit more powerful. With the ExtractPath directive you can use a regular expression to extract out a sub-set of the path and save it as a separate meta name:.
And that gets indexed as meta name "department". Note that you can map completely different areas of your file system to the same metaname:. Finally, if you have something more complicated, use -S prog and write a perl program or use a filter to set a meta tag when processing each file.
The swishrank property value is calculated based on which Ranking Scheme or algorithm you have selected. In this discussion, any time the word fancy is used, you should consult the actual code for more details. It is open source, after all. You may configure your index to bias certain metaname values more or less than others. Set to 1 default or 0 in your config file. Each term's position in each HTML document is given a structure value based on the context in which the word appears.
The structure value is used to artificially inflate the frequency of each term in that particular document. These structural values are defined in config. For example, if the word foo appears in the title of a document, the Scheme will treat that document as if foo appeared 7 additional times. The rank value is averaged for all AND'd terms. Terms within a set of parentheses are averaged as a single term this is an acknowledged weakness and is on the TODO list.
The rank value is summed and then doubled for each pair of OR'd terms. This results in higher ranks for documents that have multiple OR'd terms. After a document's raw rank score is calculated, a final rank score is calculated using a fancy log function.
All the documents are then scaled against a base score of The top-ranked document will therefore always have a swishrank value of Here is a brief overview of how the different Schemes work. The number in parentheses after the name is the value to invoke that scheme with swish-e -R or RankScheme. The default ranking scheme considers the number of times a term appears in a document frequency , the MetaNamesRank and the structure value.
The rank might be summarized as:. Every word instance starts with a base score of 1. Then for each instance of your word, a running sum is taken of the structural value of that word position plus any bias you've configured. That means there was one instance of our word in the title of the file. The base rank of 1 plus the structure score of 7 equals 8.
That's fancy ranking lingo for taking into account the total frequency of a term across the entire index, in addition to the term's frequency in a single document. IDF ranking also uses the relative density of a word in a document to judge its relevancy. This site is not directly affiliated with Swishzone.
All trademarks, registered trademarks, product names and company names or logos mentioned herein are the property of their respective owners. All informations about programs or games on this website have been found in open sources on the Internet. All programs and games not hosted on our site. When visitor click 'Download now' button files will downloading directly from official sources owners sites.
QP Download is strongly against the piracy, we do not support any manifestation of piracy.
0コメント