I have a number of large csv (tab delimited) data stored as azure blobs, and I want to create a pandas dataframe from these. I can do this locally as follows:
from azure.storage.blob import BlobService
import pandas as pd
import os.path
STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'
blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
df_customer = pd.read_csv(LOCALFILENAME, sep='\t')
However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from csv, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).
I can get to the desired end result (pandas dataframe for blob csv data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas dataframe, but I'd prefer to just read straight from the blob storage location.
I think you want to use get_blob_to_bytes
, or get_blob_to_text
; these should output a string which you can use to create a dataframe as
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))