I have decided to move my C# daemon application (using dotConnect as ADO.NET provider) from SQL Server 2008 R2 to PostgreSQL 9.0.4 x64 (on Windows Server 2008 R2). Therefore I slightly modified all queries to match PostgreSQL syntax and... got stuck on behavior which never happened with the same queries on SQL Server (not even on lowly Express edition).
Let's say the database contains 2 very simple tables without any relation to each other. They look somewhat like this: ID, Name, Model, ScanDate, Notes. I have a transformation process which reads data over TCP/IP, processes it, starts a transaction and puts the results into aforementioned 2 tables using vanilla INSERTs. The tables are initially empty; no BLOB columns. There are about 500.000 INSERTs on a bad day, all wrapped in a single transaction (and cannot be split into multiple transactions, btw). No SELECTs, UPDATEs or DELETEs are ever made. An example of INSERT (ID is bigserial - autoincremented automatically):
INSERT INTO logs."Incoming" ("Name", "Model", "ScanDate", "Notes")
VALUES('Ford', 'Focus', '2011-06-01 14:12:32', NULL)
SQL Server calmly accepts the load while maintaining a reasonable Working Set of ~200 MB. PostgreSQL, however, takes up additional 30 MB each second the transaction runs (!) and quickly exhausts system RAM.
I've done my RTFM and tried fiddling with postgresql.conf: setting "work_mem" to a minimum 64 kB (this slightly slowed down the RAM hogging), reducing "shared_buffers" / "temp_buffers" to minimum (no difference), - but to no avail. Reducing transaction isolation level to Read Uncommitted didn't help. There are no indexes except the one on ID BIGSERIAL (PK). SqlCommand.Prepare() makes no difference. No concurrent connections ever are established: daemon uses the database exclusively.
It may seem PostgreSQL cannot cope with mind-numbingly simple INSERT-fest, while SQL Server can do that. Maybe it's a PostgreSQL snapshot-vs-SQL Server locks isolation difference? It's a fact for me: vanilla SQL Server works, while neither vanilla nor tweaked PostgreSQL does.
What can I do to make PostgreSQL memory consumption to remain flat (as is apparently the case with SQL Server) while INSERT-based transaction runs?
DDL:
CREATE TABLE sometable
(
"ID" bigserial NOT NULL,
"Name" character varying(255) NOT NULL,
"Model" character varying(255) NOT NULL,
"ScanDate" date NOT NULL,
CONSTRAINT "PK" PRIMARY KEY ("ID")
)
WITH (
OIDS=FALSE
);
C# (requires Devart.Data.dll & Devart.Data.PostgreSql.dll)
PgSqlConnection conn = new PgSqlConnection("Host=localhost; Port=5432; Database=testdb; UserId=postgres; Password=###########");
conn.Open();
PgSqlTransaction tx = conn.BeginTransaction(IsolationLevel.ReadCommitted);
for (int ii = 0; ii < 300000; ii++)
{
PgSqlCommand cmd = conn.CreateCommand();
cmd.Transaction = tx;
cmd.CommandType = CommandType.Text;
cmd.CommandText = "INSERT INTO public.\"sometable\" (\"Name\", \"Model\", \"ScanDate\") VALUES(@name, @model, @scanDate) RETURNING \"ID\"";
PgSqlParameter parm = cmd.CreateParameter();
parm.ParameterName = "@name";
parm.Value = "SomeName";
cmd.Parameters.Add(parm);
parm = cmd.CreateParameter();
parm.ParameterName = "@model";
parm.Value = "SomeModel";
cmd.Parameters.Add(parm);
parm = cmd.CreateParameter();
parm.ParameterName = "@scanDate";
parm.PgSqlType = PgSqlType.Date;
parm.Value = new DateTime(2011, 6, 1, 14, 12, 13);
cmd.Parameters.Add(parm);
cmd.Prepare();
long newID = (long)cmd.ExecuteScalar();
}
tx.Commit();
This recreates the memory hogging. HOWEVER: if the 'cmd' variable is created and .Prepare()d outside the FOR loop, the memory does not increase! Apparently, preparing multiple PgSqlCommands with IDENTICAL SQL but different parameter values does not result in a single query plan inside PostgreSQL, like it does in SQL Server.
The problem remains: if one uses Fowler's Active Record dp to insert multiple new objects, prepared PgSqlCommand instance sharing is not elegant.
Is there a way/option to facilitate query plan reuse with multiple queries having identical structure yet different argument values?
I've decided to look at the simplest possible case - where a SQL batch is run directly on DBMS, without ADO.NET (suggested by Jordani). Surprisingly, PostgreSQL does not compare incoming SQL queries and does not reuse internal compiled plans - even when incoming query has the same identical arguments! For instance, the following batch:
BEGIN TRANSACTION;
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times
COMMIT;
BEGIN TRANSACTION;
INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times
COMMIT;
and the PostgreSQL log file (thanks, Sayap!) contains:
2011-06-05 16:06:29 EEST LOG: duration: 0.000 ms statement: set client_encoding to 'UNICODE'
2011-06-05 16:06:43 EEST LOG: duration: 15039.000 ms statement: BEGIN TRANSACTION;
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- 99998 lines of the same as above
COMMIT;
Apparently, even after transmitting the whole query to the server as-is, the server cannot optimize it.
As Jordani suggested, I've tried NpgSql driver instead of dotConnect - with the same (lack of) results. However, Npgsql source for .Prepare() method contains such enlightening lines:
planName = m_Connector.NextPlanName();
String portalName = m_Connector.NextPortalName();
parse = new NpgsqlParse(planName, GetParseCommandText(), new Int32[] { });
m_Connector.Parse(parse);
The new content in the log file:
2011-06-05 15:25:26 EEST LOG: duration: 0.000 ms statement: BEGIN; SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
2011-06-05 15:25:26 EEST LOG: duration: 1.000 ms parse npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG: duration: 0.000 ms bind npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL: parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG: duration: 1.000 ms execute npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL: parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG: duration: 0.000 ms parse npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG: duration: 0.000 ms bind npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL: parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG: duration: 0.000 ms execute npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL: parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG: duration: 0.000 ms parse npgsqlplan3: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
Inefficiency is quite obvious in this log excerpt...
Frank's note about WAL is another awakening: something else to configure that SQL Server hides away from a typical MS developer.
NHibernate (even in its simplest usage) reuses prepared SqlCommands properly...if only it was used from the start...
it is obvious that an architectural difference exists between SQL Server and PostgreSQL, and the code specifically built for SQL Server (and thus blissfully unaware of the 'unable-to-reuse-identical-sql' possibility) will not work efficiently on PostgreSQL without major refactoring. And refactoring 130+ legacy ActiveRecord classes to reuse prepared SqlCommand objects in a messy multithreaded middleware is not a 'just-replace-dbo-with-public'-type affair.
Unfortunately for my overtime, Eevar's answer is correct :)
Thanks to everyone who pitched in!
Reducing work_mem and shared_buffers is not a good idea, databases (including PostgreSQL) love RAM.
But this might not be your biggest problem, what about the WAL-settings? wal_buffers should be large enough to hold the entire transaction, all 500k INSERT's. What is the current setting? And what about checkpoint_segments?
500k INSERT's should not be a problem, PostgreSQL can handle this without memory problems.
http://www.postgresql.org/docs/current/interactive/runtime-config-wal.html