How to efficiently write large files to disk on background thread (Swift)

Question 1

How to efficiently write large files to disk on background thread (Swift)

ios swift multithreading large-files large-data

Tommie C. · Aug 12, 2015 · Viewed 18.6k times · Source

Answer

Answer

Performance depends wether or not the data fits in RAM. If it does, then you should use NSData writeToURL with the atomically feature turned on, which is what you're doing.

Apple's notes about this being dangerous when "writing to a public directory" are completely irrelevant on iOS because there are no public directories. That section only applies to OS X. And frankly it's not really important there either.

So, the code you've written is as efficient as possible as long as the video fits in RAM (about 100MB would be a safe limit).

For files that don't fit in RAM, you need to use a stream or your app will crash while holding the video in memory. To download a large video from a server and write it to disk, you should use NSURLSessionDownloadTask.

In general, streaming (including NSURLSessionDownloadTask) will be orders of magnitude slower than NSData.writeToURL(). So don't use a stream unless you need to. All operations on NSData are extremely fast, it is perfectly capable of dealing with files that are multiple terabytes in size with excellent performance on OS X (iOS obviously can't have files that large, but it's the same class with the same performance).

There are a few issues in your code.

This is wrong:

let filePath = NSTemporaryDirectory() + named

Instead always do:

let filePath = NSTemporaryDirectory().stringByAppendingPathComponent(named)

But that's not ideal either, you should avoid using paths (they are buggy and slow). Instead use a URL like this:

let tmpDir = NSURL(fileURLWithPath: NSTemporaryDirectory())!
let fileURL = tmpDir.URLByAppendingPathComponent(named)

Also, you're using a path to check if the file exists... don't do this:

if NSFileManager.defaultManager().fileExistsAtPath( filePath ) {

Instead use NSURL to check if it exists:

if fileURL.checkResourceIsReachableAndReturnError(nil) {

Question 2

Update

I have resolved and removed the distracting error. Please read the entire post and feel free to leave comments if any questions remain.

Background

I am attempting to write relatively large files (video) to disk on iOS using Swift 2.0, GCD, and a completion handler. I would like to know if there is a more efficient way to perform this task. The task needs to be done without blocking the Main UI, while using completion logic, and also ensuring that the operation happens as quickly as possible. I have custom objects with an NSData property so I am currently experimenting using an extension on NSData. As an example an alternate solution might include using NSFilehandle or NSStreams coupled with some form of thread safe behavior that results in much faster throughput than the NSData writeToURL function on which I base the current solution.

What's wrong with NSData Anyway?

Please note the following discussion taken from the NSData Class Reference, (Saving Data). I do perform writes to my temp directory however the main reason that I am having an issue is that I can see a noticeable lag in the UI when dealing with large files. This lag is precisely because NSData is not asynchronous (and Apple Docs note that atomic writes can cause performance issues on "large" files ~ > 1mb). So when dealing with large files one is at the mercy of whatever internal mechanism is at work within the NSData methods.

I did some more digging and found this info from Apple..."This method is ideal for converting data:// URLs to NSData objects, and can also be used for reading short files synchronously. If you need to read potentially large files, use inputStreamWithURL: to open a stream, then read the file a piece at a time." (NSData Class Reference, Objective-C, +dataWithContentsOfURL). This info seems to imply that I could try using streams to write the file out on a background thread if moving the writeToURL to the background thread (as suggested by @jtbandes) is not sufficient.

The NSData class and its subclasses provide methods to quickly and easily save their contents to disk. To minimize the risk of data loss, these methods provide the option of saving the data atomically. Atomic writes guarantee that the data is either saved in its entirety, or it fails completely. The atomic write begins by writing the data to a temporary file. If this write succeeds, then the method moves the temporary file to its final location.

While atomic write operations minimize the risk of data loss due to corrupt or partially-written files, they may not be appropriate when writing to a temporary directory, the user’s home directory or other publicly accessible directories. Any time you work with a publicly accessible file, you should treat that file as an untrusted and potentially dangerous resource. An attacker may compromise or corrupt these files. The attacker can also replace the files with hard or symbolic links, causing your write operations to overwrite or corrupt other system resources.

Avoid using the writeToURL:atomically: method (and the related methods) when working inside a publicly accessible directory. Instead initialize an NSFileHandle object with an existing file descriptor and use the NSFileHandle methods to securely write the file.

Other Alternatives

One article on Concurrent Programming at objc.io provides interesting options on "Advanced: File I/O in the Background". Some of the options involve use of an InputStream as well. Apple also has some older references to reading and writing files asynchronously. I am posting this question in anticipation of Swift alternatives.

Example of an appropriate answer

Here is an example of an appropriate answer that might satisfy this type of question. (Taken for the Stream Programming Guide, Writing To Output Streams)

Using an NSOutputStream instance to write to an output stream requires several steps:

Create and initialize an instance of NSOutputStream with a repository for the written data. Also set a delegate.
Schedule the stream object on a run loop and open the stream.
Handle the events that the stream object reports to its delegate.
If the stream object has written data to memory, obtain the data by requesting the NSStreamDataWrittenToMemoryStreamKey property.
When there is no more data to write, dispose of the stream object.

I am looking for the most proficient algorithm that applies to writing extremely large files to iOS using Swift, APIs, or possibly even C/ObjC would suffice. I can transpose the algorithm into appropriate Swift compatible constructs.

Nota Bene

~~I understand the informational error below. It is included for completeness.~~ This question is asking whether or not there is a better algorithm to use for writing large files to disk with a guaranteed dependency sequence (e.g. NSOperation dependencies). If there is please provide enough information (description/sample for me to reconstruct pertinent Swift 2.0 compatible code). Please advise if I am missing any information that would help answer the question.

Note on the extension

I've added a completion handler to the base writeToURL to ensure that no unintended resource sharing occurs. My dependent tasks that use the file should never face a race condition.

extension NSData {

    func writeToURL(named:String, completion: (result: Bool, url:NSURL?) -> Void)  {

       let filePath = NSTemporaryDirectory() + named
       //var success:Bool = false
       let tmpURL = NSURL( fileURLWithPath:  filePath )
       weak var weakSelf = self


      dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), {
                //write to URL atomically
                if weakSelf!.writeToURL(tmpURL, atomically: true) {

                        if NSFileManager.defaultManager().fileExistsAtPath( filePath ) {
                            completion(result: true, url:tmpURL)                        
                        } else {
                            completion (result: false, url:tmpURL)
                        }
                    }
            })

        }
    }

This method is used to process the custom objects data from a controller using:

var items = [AnyObject]()
if let video = myCustomClass.data {

    //video is of type NSData        
    video.writeToURL("shared.mp4", completion: { (result, url) -> Void in
        if result {
            items.append(url!)
            if items.count > 0 {

                let sharedActivityView = UIActivityViewController(activityItems: items, applicationActivities: nil)

                self.presentViewController(sharedActivityView, animated: true) { () -> Void in
                //finished
    }
}
        }
     })
}

Conclusion

The Apple Docs on Core Data Performance provide some good advice on dealing with memory pressure and managing BLOBs. This is really one heck of an article with a lot of clues to behavior and how to moderate the issue of large files within your app. Now although it is specific to Core Data and not files, the warning on atomic writing does tell me that I ought to implement methods that write atomically with great care.

With large files, the only safe way to manage writing seems to be adding in a completion handler (to the write method) and showing an activity view on the main thread. Whether one does that with a stream or by modifying an existing API to add completion logic is up to the reader. I've done both in the past and am in the midst of testing for best performance.

Until then, I'm changing the solution to remove all binary data properties from Core Data and replacing them with strings to hold asset URLs on disk. I am also leveraging the built in functionality from Assets Library and PHAsset to grab and store all related asset URLs. When or if I need to copy any assets I will use standard API methods (export methods on PHAsset/Asset Library) with completion handlers to notify user of finished state on the main thread.

(Really useful snippets from the Core Data Performance article)

Reducing Memory Overhead

It is sometimes the case that you want to use managed objects on a temporary basis, for example to calculate an average value for a particular attribute. This causes your object graph, and memory consumption, to grow. You can reduce the memory overhead by re-faulting individual managed objects that you no longer need, or you can reset a managed object context to clear an entire object graph. You can also use patterns that apply to Cocoa programming in general.

You can re-fault an individual managed object using NSManagedObjectContext’s refreshObject:mergeChanges: method. This has the effect of clearing its in-memory property values thereby reducing its memory overhead. (Note that this is not the same as setting the property values to nil—the values will be retrieved on demand if the fault is fired—see Faulting and Uniquing.)

When you create a fetch request you can set includesPropertyValues to NO > to reduce memory overhead by avoiding creation of objects to represent the property values. You should typically only do so, however, if you are sure that either you will not need the actual property data or you already have the information in the row cache, otherwise you will incur multiple trips to the persistent store.

You can use the reset method of NSManagedObjectContext to remove all managed objects associated with a context and "start over" as if you'd just created it. Note that any managed object associated with that context will be invalidated, and so you will need to discard any references to and re-fetch any objects associated with that context in which you are still interested. If you iterate over a lot of objects, you may need to use local autorelease pool blocks to ensure temporary objects are deallocated as soon as possible.

If you do not intend to use Core Data’s undo functionality, you can reduce your application's resource requirements by setting the context’s undo manager to nil. This may be especially beneficial for background worker threads, as well as for large import or batch operations.

Finally, Core Data does not by default keep strong references to managed objects (unless they have unsaved changes). If you have lots of objects in memory, you should determine the owning references. Managed objects maintain strong references to each other through relationships, which can easily create strong reference cycles. You can break cycles by re-faulting objects (again by using the refreshObject:mergeChanges: method of NSManagedObjectContext).

Large Data Objects (BLOBs)

If your application uses large BLOBs ("Binary Large OBjects" such as image and sound data), you need to take care to minimize overheads. The exact definition of “small”, “modest”, and “large” is fluid and depends on an application’s usage. A loose rule of thumb is that objects in the order of kilobytes in size are of a “modest” sized and those in the order of megabytes in size are “large” sized. Some developers have achieved good performance with 10MB BLOBs in a database. On the other hand, if an application has millions of rows in a table, even 128 bytes might be a "modest" sized CLOB (Character Large OBject) that needs to be normalized into a separate table.

In general, if you need to store BLOBs in a persistent store, you should use an SQLite store. The XML and binary stores require that the whole object graph reside in memory, and store writes are atomic (see Persistent Store Features) which means that they do not efficiently deal with large data objects. SQLite can scale to handle extremely large databases. Properly used, SQLite provides good performance for databases up to 100GB, and a single row can hold up to 1GB (although of course reading 1GB of data into memory is an expensive operation no matter how efficient the repository).

A BLOB often represents an attribute of an entity—for example, a photograph might be an attribute of an Employee entity. For small to modest sized BLOBs (and CLOBs), you should create a separate entity for the data and create a to-one relationship in place of the attribute. For example, you might create Employee and Photograph entities with a one-to-one relationship between them, where the relationship from Employee to Photograph replaces the Employee's photograph attribute. This pattern maximizes the benefits of object faulting (see Faulting and Uniquing). Any given photograph is only retrieved if it is actually needed (if the relationship is traversed).

It is better, however, if you are able to store BLOBs as resources on the filesystem, and to maintain links (such as URLs or paths) to those resources. You can then load a BLOB as and when necessary.

Note:

I've moved the logic below into the completion handler (see the code above) and I no longer see any error. As mentioned before this question is about whether or not there is a more performant way to process large files in iOS using Swift.

~~When attempting to process the resulting items array to pass to a UIActvityViewController, using the following logic:~~

if items.count > 0 {
let sharedActivityView = UIActivityViewController(activityItems: items, applicationActivities: nil) self.presentViewController(sharedActivityView, animated: true) { () -> Void in //finished} }

I am seeing the following error: Communications error: { count = 1, contents = "XPCErrorDescription" => { length = 22, contents = "Connection interrupted" } }> (please note, I am looking for a better design, not an answer to this error message)

How to efficiently write large files to disk on background thread (Swift)

Answer

Related questions