Lua vs Embedded Lisp and potential other candidates. for set based data processing

Hassan Syed picture Hassan Syed · Oct 27, 2011 · Viewed 7.8k times · Source

Current Choice: lua-jit. Impressive benchmarks, I am getting used to the syntax. Writing a high performance ABI will require careful consideration on how I will structure my C++.

Other Questions of interest

  1. Gambit-C and Guile as embeddable languages
  2. Lua Performance Tips (have the option of running with a disabled collector, and calling the collector at the end of a processing run(s) is always an option).

Background

I'm working on a realtime high volume (complex) event processing system. I have a DSL that represents the schema of the event structure at source, the storage format, certain domain specific constructs, firing internal events (to structure and drive general purpose processing), and encoding certain processing steps that always happen.

The DSL looks pretty similar to SQL, infact I am using berkeley db (via sqlite3 interface) for long-term storage of events. The important part here is that the processing of events is done set-based, like SQL. I have come to conclusion however that I should not add general-purpose processing logic to the DSL, and rather embed lua or lisp to take care of this.

The processing core is built arround boost::asio, it is multithreaded, rpc is done via protocol buffers, events are encoded using the protocol buffer IO library --i.e., the events are not structured using protocol buffer object they just use the same encoding/decoding library. I will create a dataset object that contains rows, pretty similar to how a database engine stores in memory sets. processing steps in the DSL will be taken care of first and then presented to the general purpose processing logic.

Regardless of what embeddable scripting environment I use, each thread in my processing core will probably needs it's own embedded-language-environment (that is how lua requires it to be at least if you are doing multi-threaded work).

The Question(s)

The choice at the moment is between lisp ECL and lua. Keeping in mind that performance and throughput is a strong requirement, this means minimising memory allocations is highly desired:

  1. If you were in my position which language would you chose ?

  2. are there any alternatives I should consider (don't suggest languages that don't have an embeddable implementation). Javascript v8 perhaps ?

  3. Does lisp fit the domain better ? I don't think lua and lisp are that different in terms of what they provide. Call me out :D

  4. Are there any other properties (like the ones below) I should be thinking about ?

  5. I assert that any form of embedded database IO (see the example DSL below for context) dwarfs the scripting language call on orders of magnitude, and that picking either will not add much overhead to the overall throughput. Am I on the right track ? :D

Desired Properties

  1. I would like to map my dataset onto a lisp list or lua table and I would like to minimise redundant data copies. For example adding a row from one dataset to another should try to use reference semantics if both tables have the same shape.

  2. I can guarantee that the dataset that is passed as input will not change whilst I have made the lua/lisp call. I want lua and lisp to enforce not altering the dataset as well if possible.

  3. After the embedded call end's the datasets should be destroyed, any references created would need to be replaced with copies (I guess).

DSL Example

I attach a DSL for your viewing pleasure so you can get an idea of what I am trying to achieve. Note: The DSL does not show general purpose processing.

// Derived Events : NewSession EndSession
NAMESPACE WebEvents
{
  SYMBOLTABLE DomainName(TEXT) AS INT4;
  SYMBOLTABLE STPageHitId(GUID) AS INT8;
  SYMBOLTABLE UrlPair(TEXT hostname ,TEXT scriptname) AS INT4;
  SYMBOLTABLE UserAgent(TEXT UserAgent) AS INT4;  

  EVENT 3:PageInput
  {
    //------------------------------------------------------------//
    REQUIRED 1:PagehitId              GUID
    REQUIRED 2:Attribute              TEXT;
    REQUIRED 3:Value                  TEXT; 

    FABRRICATED 4:PagehitIdSymbol     INT8;
    //------------------------------------------------------------//

    PagehitIdSymbol AS PROVIDED(INT8 ph_symbol)
                    OR Symbolise(PagehitId) USING STPagehitId;
  }

  // Derived Event : Pagehit
  EVENT 2:PageHit
  {
    //------------------------------------------------------------//
    REQUIRED 1:PageHitId              GUID;
    REQUIRED 2:SessionId              GUID;
    REQUIRED 3:DateHit                DATETIME;
    REQUIRED 4:Hostname               TEXT;
    REQUIRED 5:ScriptName             TEXT;
    REQUIRED 6:HttpRefererDomain      TEXT;
    REQUIRED 7:HttpRefererPath        TEXT;
    REQUIRED 8:HttpRefererQuery       TEXT;
    REQUIRED 9:RequestMethod          TEXT; // or int4
    REQUIRED 10:Https                 BOOL;
    REQUIRED 11:Ipv4Client            IPV4;
    OPTIONAL 12:PageInput             EVENT(PageInput)[];

    FABRRICATED 13:PagehitIdSymbol    INT8;
    //------------------------------------------------------------//
    PagehitIdSymbol AS  PROVIDED(INT8 ph_symbol) 
                    OR  Symbolise(PagehitId) USING STPagehitId;

    FIRE INTERNAL EVENT PageInput PROVIDE(PageHitIdSymbol);
  }

  EVENT 1:SessionGeneration
  {
    //------------------------------------------------------------//
        REQUIRED    1:BinarySessionId   GUID;
    REQUIRED    2:Domain            STRING;
    REQUIRED    3:MachineId         GUID;
    REQUIRED    4:DateCreated       DATETIME;
    REQUIRED    5:Ipv4Client        IPV4;
    REQUIRED    6:UserAgent         STRING;
    REQUIRED    7:Pagehit           EVENT(pagehit);

    FABRICATED  8:DomainId          INT4;
    FABRICATED  9:PagehitId         INT8;
    //-------------------------------------------------------------//

    DomainId  AS SYMBOLISE(domain)            USING DomainName;
    PagehitId AS SYMBOLISE(pagehit:PagehitId) USING STPagehitId;

    FIRE INTERNAL EVENT pagehit PROVIDE (PagehitId);
  } 
}

This project is a component of a Ph.D research project and is/will be free software. If your interested in working with me (or contributing) on this project, please leave a comment :D

Answer

snogglethorpe picture snogglethorpe · Oct 28, 2011

I strongly agree with @jpjacobs's points. Lua is an excellent choice for embedding, unless there's something very specific about lisp that you need (for instance, if your data maps particularly well to cons-cells).

I've used lisp for many many years, BTW, and I quite like lisp syntax, but these days I'd generally pick Lua. While I like the lisp language, I've yet to find a lisp implementation that captures the wonderful balance of features/smallness/usability for embedded use the way Lua does.

Lua:

  1. Is very small, both source and binary, an order of magnitude or more smaller than many more popular languages (Python etc). Because the Lua source code is so small and simple, it's perfectly reasonable to just include the entire Lua implementation in your source tree, if you want to avoid adding an external dependency.

  2. Is very fast. The Lua interpreter is much faster than most scripting languages (again, an order of magnitude is not uncommon), and LuaJIT2 is a very good JIT compiler for some popular CPU architectures (x86, arm, mips, ppc). Using LuaJIT can often speed things up by another order of magnitude, and in many cases, the result approaches the speed of C. LuaJIT is also a "drop-in" replacement for standard Lua 5.1: no application or user code changes are required to use it.

  3. Has LPEG. LPEG is a "Parsing Expression Grammar" library for Lua, which allows very easy, powerful, and fast parsing, suitable for both large and small tasks; it's a great replacement for yacc/lex/hairy-regexps. [I wrote a parser using LPEG and LuaJIT, which is much faster than the yacc/lex parser I was trying emulate, and was very easy and straight-forward to create.] LPEG is an add-on package for Lua, but is well-worth getting (it's one source file).

  4. Has a great C-interface, which makes it a pleasure to call Lua from C, or call C from Lua. For interfacing large/complex C++ libraries, one can use SWIG, or any one of a number of interface generators (one can also just use Lua's simple C interface with C++ of course).

  5. Has liberal licensing ("BSD-like"), which means Lua can be embedded in proprietary projects if you wish, and is GPL-compatible for FOSS projects.

  6. Is very, very elegant. It's not lisp, in that it's not based around cons-cells, but it shows clear influences from languages like scheme, with a straight-forward and attractive syntax. Like scheme (at least in it's earlier incarnations), it tends towards "minimal" but does a good job of balancing that with usability. For somebody with a lisp background (like me!), a lot about Lua will seem familiar, and "make sense", despite the differences.

  7. Is very flexible, and such features as metatables allow easily integrating domain-specific types and operations.

  8. Has a simple, attractive, and approachable syntax. This might not be such an advantage over lisp for existing lisp users, but might be relevant if you intend to have end-users write scripts.

  9. Is designed for embedding, and besides its small size and fast speed, has various features such as an incremental GC that make using a scripting language more viable in such contexts.

  10. Has a long history, and responsible and professional developers, who have shown good judgment in how they've evolved the language over the last 2 decades.

  11. Has a vibrant and friendly user-community.