Making Apache Solr Search Site In order to make Apache Solr search site content, we need to index whole site.
To index whole site, we need web crawler Apache Nutch by which we can index site data. Apache Nutch is easily configurable with Apache Solr.
Steps to Install Apache Nutch:
(1) Download a binary package (apache-nutch-1.x-bin.zip) from here.
(2) Unzip on local drive.
(3) Go to extracted apache-nutch-1.x folder's bin folder (e.g.: apache-nutch-1.x/bin)
(4) Run "./nutch" (Make sure JAVA_HOME variable is set properly).
(5) You can confirm a correct installation if you seeing the following:
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
(5) Now edit a file conf/nutch-site.xml with following content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
</configuration>
(6) Create new directory named urls in apache-nutch-1.x root folder
(7) Create a file named seeds.txt inside urls directory with the following content (one URL per line for each site you want Nutch to crawl):
(8) Now edit a file conf/regex-urlfilter.txt and replace as below:
# accept anything else
+.
with a regular expression matching the domain you wish to crawl.
For example, if you wish to limit the crawl to the surekhatech.com domain, the line should be like below:
+^http://([a-z0-9]*\.)*surekhatech.com/
This will include any URL in the domain surekhatech.com.
(9) Now we are ready to initiate a crawl, use the following parameters:
- -dir dir names the directory to put the crawl in.
- -threads threads determines the number of threads that will fetch in parallel.
- -depth depth indicates the link depth from the root page that should be crawled.
- -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
- Run the following command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
(10) Integrate Solr with Nutch
Finally we have both Nutch and Solr installed and setup correctly, Nutch have already created crawl data from the seed URL(s).
Below are some steps to delegate searching to Solr for links to be searchable:
We need to copy schema.xml (schema-solr4.xml for sorl 4.3.0) file from Nutch to Solr as below:
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf
Restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
Run the Solr Index command:
?\bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdbCrawl/segments/*
If you want to see the raw HTML indexed by Solr, change the content field definition in schema.xml to:
<field name="content" type="text" stored="true" indexed="true"/>
(11) Now open http://localhost:8983/solr/collection1/browse and search for any content, you will get search result from site’s contents.