Getting transient errors when making calls against Azure SQL Database from Azure Function

VSorokoumov picture VSorokoumov · Jan 14, 2019 · Viewed 7.2k times · Source

We are using .NET Core 2.1 and Entity Framework Core 2.1.1

I have the following setup in Azure West Europe

  • Azure SQL Database -- Premium P2 250 DTU -- Public endpoint, no VNET peering -- "Allow access to Azure Services" = ON

  • Azure Functions -- Consumption Plan -- Timeout 10 Minutes

  • Azure Blob storage -- hot tier

Multiple blobs are uploaded to Azure Blob storage, Azure Functions (up to 5 concurrently) are fired via Azure Event Grid. Azure Functions check structure of the blobs against metadata stored in Azure SQL DB. Each blob contains up to 500K records and 5 columns of payload data. For each record Azure Functions makes a call against Azure SQL DB, so no caching.

I am getting often, when multiple blobs are processed in parallel (up to 5 asynchronous Azure Functions call at the same time), and when the blob size is larger 200K-500K records, the following transient and connection errors from .NET Core Entity Framework:

1. An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call.

2. A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: SSL Provider, error: 0 - The wait operation timed out.)

3. Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. This failure occurred while attempting to connect to the routing destination. The duration spent while attempting to connect to the original server was - [Pre-Login] initialization=13633; handshake=535; [Login] initialization=1; authentication=0; [Post-Login] complete=156; The duration spent while attempting to connect to this server was - [Pre-Login] initialization=5679; handshake=2044;

4. A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: SSL Provider, error: 0 - The wait operation timed out.)

  1. Server provided routing information, but timeout already expired.

At the same time there are any/no health events reported for the Azure SQL Database during the test, and the metrics look awesome: MAX Workers < 3.5%, Sum Successful Connections < 35, MAX Sessions Percentage < 0.045%, Max Log UI percentage < 0.024%, Sum Failed Connections = 0, MAX DTU < 10%, Max Data IO < 0.055%, MAX CPU < 10%.

Running connection stats on Azure SQL DB (sys.database_connection_stats_ex): No failed, aborted or throttled connections.

select *
from sys.database_connection_stats_ex
where start_time >= CAST(FLOOR(CAST(getdate() AS float)) AS DATETIME)
order by start_time desc

Has anyone faced similar issues in combintation with .Net Core Entity Framework and Azure SQL Database. Why I am getting those transient errors, why Azure SQL Database metrics look so good not reflecting at all that there are issues?

Thanks a lot in advance for any help.

using Microsoft.EntityFrameworkCore;

namespace MyProject.Domain.Data
{
    public sealed class ApplicationDbContextFactory : IApplicationDbContextFactory
    {
        private readonly IConfigurationDbConfiguration _configuration;
        private readonly IDateTimeService _dateTimeService;

        public ApplicationDbContextFactory(IConfigurationDbConfiguration configuration, IDateTimeService dateTimeService)
        {
            _configuration = configuration;
            _dateTimeService = dateTimeService;
        }

        public ApplicationDbContext Create()
        {
            //Not initialized in ctor due to unit testing static functions.
            var options = new DbContextOptionsBuilder<ApplicationDbContext>()
                .UseSqlServer(_configuration.ConfigurationDbConnectionString).Options;

            return new ApplicationDbContext(options, _dateTimeService);
        }
    }
}

Answer

Thomas picture Thomas · Jan 26, 2019

I've found this good documentation around sql database transient errors:

From the documentation:

A transient error has an underlying cause that soon resolves itself. An occasional cause of transient errors is when the Azure system quickly shifts hardware resources to better load-balance various workloads. Most of these reconfiguration events finish in less than 60 seconds. During this reconfiguration time span, you might have connectivity issues to SQL Database. Applications that connect to SQL Database should be built to expect these transient errors. To handle them, implement retry logic in their code instead of surfacing them to users as application errors.

Then it explains in details how to build retry logic for transient errors.

Entity Framework with SQL server implements a retry logic:

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    optionsBuilder
        .UseSqlServer("<connection string>", options => options.EnableRetryOnFailure());
}

You can find more information here: