Its probably hard to find a comparison between Apache Lucene and the Google Search Appliance because they're such different things. While Lucene is a software component for indexing documents with basic relevance "boosting" built in, the GSA is an enterprise search product (appliance/physical hardware) with lot's of out-of-the-box functionality to tune and optimize search results based off of the Google search algorithm.
So they are basically two great tools with different implementation scenarios. But of course overlap especially if used for providing search on your average website.
Off the top of my head a few topics you might want to start with for a comparison:
Deployment/Architecture
- Lucene is a software component that can be deeply integrated in your own software providing an index (usually file based, sometimes in memory) to index and retrieve content quickly.
- The lucene project provides quite a large list of analyzers to do propper indexing of different languages (western languages, arabic, asian etc.) but has room for improvements with analyzers
- Lucene for .Net is quite a popular port to be integrated on Microsoft .Net Plattforms.
- GSA software and hardware bundled together and sold as an appliance with an HTTP(s) interface providing the search results in either HTML (through its own XSLTs) or XML (for better integration in your website)
- GSA comes with language bundles (installed and downloadable). You'd have to choose one of the bundles. If you need support for more languages you might need to add another GSA to the infrastructure (if all required languages are not in a single bundle)
- GSA is performing excellent and requires very little maintenance
- GSA let's you scale with almost no engineering effort. globally distributed, but connected GSAs can be set up through the web interface
- GSA can be made HA by purchasing a cheaper hot-backup module
Indexing
- Lucene provides crawlers (and a crawler API) to index content. It doesn't care if your crawler actually crawls the website like Google or if you crawl a database based on SQL statements or provide a text stream read out from flat files. But usually you have to implement the crawler if the provided does not fit your needs
- GSA uses the crawler technology used by Google, respecting Robots instructions (in TXT or Meta tags), it provides a feed API for sources that can not be crawled (i.e. no linking between them) and it supports setting up SQL queries to all mayor DBs for retrievel of data out of a database (be it URLs to crawl or the data itself)
Retrieval / relevance tuning
- Lucene does not aim at and has no good support for relevance tuning (except boosting entries in the index). It's up to the application using the index results to do the tuning
- Lucene is the index used by SOLR which provides tuning and architectures more similar to a GSA (including result retrievel over HTTP(s))
- GSA let's you bias result sets based on meta-data, date and URL patterns. In the latest version you can even set up your own entities and bias the results based on them
- GSA supports out of the box facets for meta-data and some more fancy stuff on their interface like preview images for documents, autosuggest etc.
Commercial things
- Lucene is an Open Source (no costs) Product, but requires hardware to be purchased
- GSA starts at around $20k for 500k documents/URLs
- Google provides several support levels
- GSA licenses have to be renewed on a 2 or 3 year basis (you get new hardware)
- GSA does not require any additional hardware (appliance is included)
...there's so much more to add, but I hope you get the point.
Update February 2016:
Google has informed partners that the GSA will be discontinued around 2019. The best site to link to at the moment seems to be http://fortune.com/2016/02/04/google-ends-search-appliance/.