SOLR autoCommit vs autoSoftCommit

hajime picture hajime · Jul 15, 2013 · Viewed 25.6k times · Source

I'm very confused about and . Here is what I understand

  • autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.

  • autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.

My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?

<updateHandler class="solr.DirectUpdateHandler2">
    <updateLog>
        <str name="dir">${solr.ulog.dir:}</str>
    </updateLog>
   <autoSoftCommit>
        <maxDocs>1000</maxDocs>
        <maxTime>1200000</maxTime>
    </autoSoftCommit>
    <autoCommit>
        <maxDocs>10000</maxDocs>
        <maxTime>120000</maxTime> 
        <openSearcher>false</openSearcher>
    </autoCommit>
</updateHandler>

why is autoCommit working on it's own ?

Answer

vruizext picture vruizext · Oct 30, 2013

I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.

I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?

We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.

These are places to start.

HEAVY (BULK) INDEXING

The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.

Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher. Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first. Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include: Turning off the tlog completely for the bulk-load operation Indexing offline with some kind of map-reduce process Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog. etc.

INDEX-HEAVY, QUERY-LIGHT

By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.

Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand. Set your hard commit to 15 seconds, openSearcher=false

INDEX-LIGHT, QUERY-LIGHT OR HEAVY

This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update

Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.

INDEX-HEAVY, QUERY-HEAVY

This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start

Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free. Set your hard commit interval to 15 seconds.

In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.

This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.