Apache

Apache SOLR installation and configurations steps (Documents and Sites search using NUTCH)

July 21, 2013

Please follow below mentioned steps to install SOLR:

(1) Download the latest version from the following link.

http://lucene.apache.org/solr/

( I used Apache Solr 4.3.0 which was latest at that time.)

(2) Extract the bundle to your desired location.

(3) Open command terminal, go to the folder where you have extracted Solr bundle then go to example folder and enter following command:

java -jar start.jar

After above command solr will be started, to access admin go to following url.

http://localhost:8983/solr/#/collection1

Indexing xml Documents to Solr
(1) Open new command terminal and go to exampledocs folder inside solr-4.3.0/example.

(2) To index any xml (e.g. hd.xml) document to Solr use the following command.

java -jar post.jar hd.xml

Now the file hd.xml is indexed and available for search, you can test it using following url:

http://localhost:8983/solr/collection1/browse

Now type any keyword of the file and you would be able to get search results accordingly.

(3) To index all xml documents located in exampledocs folder type the following command:

java -jar post.jar *.xml

(4) To index you custom xml file create xml file as follows ( I have created file top10.xml which contents top 10 movies)
```
<add>
  <doc>
    <field name="id">Mov-1</field>
    <field name="name">The Shawshank Redemption</field>
    <field name="ratings">9.2</field>
    <field name="description">Story about two prisoners who bond and have friendship...</field>
  </doc>
</add>
```
I have created this file in entertainment folder located in solr-4.3.0. To index this file execute following command:

java -jar exampledocs/post.jar entertainment/top10.xml

You may find there is error field named ratings in a file. To resolve this, we need to update schema.xml file located in solr-4.3.0/example/solr/collection1/conf directory.

Add the field you want (which is not present), for my case I need to add following fields,

<field name="ratings" type="float" indexed="true" stored="true" multiValued="true"/>

Now restart the server and index the file by:

java -jar exampledocs/post.jar entertainment/top10.xml

Yur file will now be available.
Indexing PDF and DOC Files

(1) In order to index PDF or DOC files, first you need to have curl installed, if not installed, then install it by below command:
- For Ubuntu,
- sudo apt-get install curl
- Windows can download curl from following link,
- http://curl.haxx.se/download.html
- set environment variable to use curl.
(2) Now time to edit Solr instance's solrconfig.xml (solr-4.3.0/example/solr/collection1/conf ) where need to add following configuration:
```
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="fmap.content">text</str>
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">true</str>
  </lst>
</requestHandler>
```
(3) Now, create new folder named "extract" anywhere on your system (I have created that folder in a same directory where Solr is installed), and place apache-solr-cell-4.0.0.jar file under that folder from dist directory (you can find it in the Solr distribution archive).

After that you need to copy all library files from contrib/extraction/lib/ directory to the "extract" folder which you have created before.

(4) In addition to that, we need add following entries in solrconfig.xml file:

<lib dir="../../extract" regex=".*\.jar" />

(5) To index PDF file, use following command:

curl "http://localhost:8983/solr/update/extract?literal.id=fileId&commit=true" -F "[email protected]"

Replace <field> value in URL with id which you want to give as file unique id for a file to be indexed, you can fire this command from the folder containing actual file (e.g. fileName.pdf).

Here I have it in a same folder for this example.

Make sure port number and url is same as Solr is running.

After firing above command you will be able to see the following in response:
```
    <?xml version="1.0" encoding="UTF-8"?>
    
    <response>
      <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">578</int>
      </lst>
    </response>
    
```
To index DOC file repalce fileName.pdf to desired fileName.docx.

Searching File Metadata
Go to http://localhost:8983/solr/select?q=Shawshank and replace Shawshank by keyword you want to search. You will see the file contains the keyword.

Deleting Indexed File
To delete a particular indexed file, fire below command from exampledocs folder:

java -Ddata=args -jar post.jar "<delete><query>id:fieId</query></delete>"

Here <field> would be the same file unique id which we have used while first time indexed the file.

If you don’t know the id of file you can view it by visiting http://localhost:8983/solr/select?q=keyword

It will show all the files containing searched keyword. Now look for file you want to delete and look for the line:

<str name="id">file_id</str>

Copy this id and replace this with <fileId> in above delete query.

After executing the command, you will get the following response:

SimplePostTool version 1.5

POSTing args to http://localhost:8983/solr/update..

COMMITting Solr index changes to http://localhost:8983/solr/update..

Time spent: 0:00:00.261

Updating link of Indexed pdf/doc and make it available for download
You can see the file in search results but you will not able to download the file, if you try to download you will get 404 error telling resource not found.

How to resolve this error?

Answer is solr does not upload documents on its own server, for this we need to have another web server like tomcat which is running parallal to solr and will support document download.

First makes sure your tomcat server is running as mine here:

localhost:8080

Now we need to setup actual file URL and its path on tomcat server.

Use following command for setting up actual file URL:

curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"fileId","url":{"set":"http://localhost:8080/path-to-file/fileName.pdf(doc)"}}]'

While executing command, replace <fileId> with file unique id and you are ready to go for file downloading in search result.

Making Apache Solr Search Site
In order to make Apache Solr search site content, we need to index whole site.

To index whole site, we need web crawler Apache Nutch by which we can index site data. Apache Nutch is easily configurable with Apache Solr.

Steps to Install Apache Nutch:

(1) Download a binary package (apache-nutch-1.x-bin.zip) from here.

(2) Unzip on local drive.

(3) Go to extracted apache-nutch-1.x folder's bin folder (e.g.: apache-nutch-1.x/bin)

(4) Run "./nutch" (Make sure JAVA_HOME variable is set properly).

(5) You can confirm a correct installation if you seeing the following:

Usage: nutch COMMAND

where COMMAND is one of:

crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)

readdb            read / dump crawl db

mergedb           merge crawldb-s, with optional filtering

readlinkdb        read / dump link db

inject            inject new urls into the database

generate          generate new segments to fetch from crawl db

freegen           generate new segments to fetch from text files

fetch             fetch a segment's pages

parse             parse a segment's pages

readseg           read / dump segment data

mergesegs         merge several segments, with optional filtering and slicing

updatedb          update crawl db from segments after fetching

invertlinks       create a linkdb from parsed segments

mergelinkdb       merge linkdb-s, with optional filtering

solrindex         run the solr indexer on parsed segments and linkdb

solrdedup         remove duplicates from solr

solrclean         remove HTTP 301 and 404 documents from solr

parsechecker      check the parser for a given url

indexchecker      check the indexing filters for a given url

domainstats       calculate domain statistics from crawldb

webgraph          generate a web graph from existing segments

linkrank          run a link analysis program on the generated web graph

scoreupdater      updates the crawldb with linkrank scores

nodedumper        dumps the web graph's node scores

plugin            load a plugin and run one of its classes main()

junit             runs the given JUnit test

or

CLASSNAME         run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

(5) Now edit a file conf/nutch-site.xml with following content:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>

<name>http.agent.name</name>

<value>My Nutch Spider</value>

</property>

</configuration>

(6) Create new directory named urls in apache-nutch-1.x root folder

(7) Create a file named seeds.txt inside urls directory with the following content (one URL per line for each site you want Nutch to crawl):

https://www.surekhatech.com/

http://nextsiteurl/

(8) Now edit a file conf/regex-urlfilter.txt and replace as below:

# accept anything else
+.

with a regular expression matching the domain you wish to crawl.

For example, if you wish to limit the crawl to the surekhatech.com domain, the line should be like below:

+^http://([a-z0-9]*\.)*surekhatech.com/

This will include any URL in the domain surekhatech.com.

(9) Now we are ready to initiate a crawl, use the following parameters:
- -dir dir names the directory to put the crawl in.
- -threads threads determines the number of threads that will fetch in parallel.
- -depth depth indicates the link depth from the root page that should be crawled.
- -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
- Run the following command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
(10) Integrate Solr with Nutch
Finally we have both Nutch and Solr installed and setup correctly, Nutch have already created crawl data from the seed URL(s).

Below are some steps to delegate searching to Solr for links to be searchable:
- We need to copy schema.xml (schema-solr4.xml for sorl 4.3.0) file from Nutch to Solr as below:
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf
- Restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
- Run the Solr Index command:
?\bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdbCrawl/segments/*

If you want to see the raw HTML indexed by Solr, change the content field definition in schema.xml to:

<field name="content" type="text" stored="true" indexed="true"/>
(11) Now open http://localhost:8983/solr/collection1/browse and search for any content, you will get search result from site’s contents.