I found 7-zip great and I will like to use it on .net applications. I have a 10MB file (a.001) and it takes:
2 seconds to encode.
Now it will be nice if I could do the same thing on c#. I have downloaded http://www.7-zip.org/sdk.html LZMA SDK c# source code. I basically copied the CS directory into a console application in visual studio:
Then I compiled and eveything compiled smoothly. So on the output directory I placed the file a.001
which is 10MB of size. On the main method that came on the source code I placed:
[STAThread]
static int Main(string[] args)
{
// e stands for encode
args = "e a.001 output.7z".Split(' '); // added this line for debug
try
{
return Main2(args);
}
catch (Exception e)
{
Console.WriteLine("{0} Caught exception #1.", e);
// throw e;
return 1;
}
}
when I execute the console application the application works great and I get the output a.7z
on the working directory. The problem is that it takes so long. It takes about 15 seconds to execute! I have also tried https://stackoverflow.com/a/8775927/637142 approach and it also takes very long. Why is it 10 times slower than the actual program ?
Even if I set to use only one thread:
It still takes much less time (3 seconds vs 15):
Could it be because C# is slower than assembly or C ? I notice that the algorithm does a lot of heavy operations. For example compare these two blocks of code. They both do the same thing:
#include <time.h>
#include<stdio.h>
void main()
{
time_t now;
int i,j,k,x;
long counter ;
counter = 0;
now = time(NULL);
/* LOOP */
for(x=0; x<10; x++)
{
counter = -1234567890 + x+2;
for (j = 0; j < 10000; j++)
for(i = 0; i< 1000; i++)
for(k =0; k<1000; k++)
{
if(counter > 10000)
counter = counter - 9999;
else
counter= counter +1;
}
printf (" %d \n", time(NULL) - now); // display elapsed time
}
printf("counter = %d\n\n",counter); // display result of counter
printf ("Elapsed time = %d seconds ", time(NULL) - now);
gets("Wait");
}
output
static void Main(string[] args)
{
DateTime now;
int i, j, k, x;
long counter;
counter = 0;
now = DateTime.Now;
/* LOOP */
for (x = 0; x < 10; x++)
{
counter = -1234567890 + x + 2;
for (j = 0; j < 10000; j++)
for (i = 0; i < 1000; i++)
for (k = 0; k < 1000; k++)
{
if (counter > 10000)
counter = counter - 9999;
else
counter = counter + 1;
}
Console.WriteLine((DateTime.Now - now).Seconds.ToString());
}
Console.Write("counter = {0} \n", counter.ToString());
Console.Write("Elapsed time = {0} seconds", DateTime.Now - now);
Console.Read();
}
Output
Note how much slower was c#. Both programs where run from outside visual studio on release mode. Maybe that is the reason why it takes so much longer in .net than on c++.
I cannot seem to know what is causing the problem. I guess I will use 7z.dll and invoke the necessary methods from c#. A library that does that is at: http://sevenzipsharp.codeplex.com/ and that way I am using the same library that 7zip is using as:
// dont forget to add reference to SevenZipSharp located on the link I provided
static void Main(string[] args)
{
// load the dll
SevenZip.SevenZipCompressor.SetLibraryPath(@"C:\Program Files (x86)\7-Zip\7z.dll");
SevenZip.SevenZipCompressor compress = new SevenZip.SevenZipCompressor();
compress.CompressDirectory("MyFolderToArchive", "output.7z");
}
I ran a profiler on the code, and the most expensive operation appears to be in searching for matches. In C#, it's searching a single byte at a time. There are two functions (GetMatches and Skip) in LzBinTree.cs which contain the following code snippet, and it spends something like 40-60% of its time on this code:
if (_bufferBase[pby1 + len] == _bufferBase[cur + len])
{
while (++len != lenLimit)
if (_bufferBase[pby1 + len] != _bufferBase[cur + len])
break;
It's basically trying to find the match length a single byte at a time. I extracted that into its own method:
if (GetMatchLength(lenLimit, cur, pby1, ref len))
{
And if you use unsafe code and cast the byte* to a ulong* and compare 8 bytes at a time instead of 1, the speed almost doubled for my test data (in a 64 bit process):
private bool GetMatchLength(UInt32 lenLimit, UInt32 cur, UInt32 pby1, ref UInt32 len)
{
if (_bufferBase[pby1 + len] != _bufferBase[cur + len])
return false;
len++;
// This method works with or without the following line, but with it,
// it runs much much faster:
GetMatchLengthUnsafe(lenLimit, cur, pby1, ref len);
while (len != lenLimit
&& _bufferBase[pby1 + len] == _bufferBase[cur + len])
{
len++;
}
return true;
}
private unsafe void GetMatchLengthUnsafe(UInt32 lenLimit, UInt32 cur, UInt32 pby1, ref UInt32 len)
{
const int size = sizeof(ulong);
if (lenLimit < size)
return;
lenLimit -= size - 1;
fixed (byte* p1 = &_bufferBase[cur])
fixed (byte* p2 = &_bufferBase[pby1])
{
while (len < lenLimit)
{
if (*((ulong*)(p1 + len)) == *((ulong*)(p2 + len)))
{
len += size;
}
else
return;
}
}
}