Reading & Writing Files from/to Storj with Pandas

Using s3fs-supported pandas API

Intro

Pandas is one of the most used libraries in Python for Data Analysis, and Data Science. Storj DCS gives you the ability to store Datasets with a great level of durability, privacy, and security.

In this post, we will look at how to use pandas to save and load data to Storj DCS.

Content

Requirements
Configuring pandas
Saving data to Storj DCS
Loading data from Storj DCS
Conclusion

Requirements

We rely on the new feature introduced by pandas called storage_options. This extra option gives us the capability to use specific storage connections. Storage option was introduced on version 1.2.0, further details you can find here storage_options.

From pandas 0.20.1 documentation:

“pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.”

s3fs_required.

Installing pandas Version 1.2.0

pip3 install pandas==1.20

Installing s3fs

pip3 install s3fs

Configuring pandas

If you already have a Storj DCS account, you just need to get your keys and endpoint url.

We are going to load the credentials from environment variables. You should have these 3 variables available: ACCESS_KEY_ID, SECRET_ACCESS_KEY and ENDPOINT_URL

This configuration will work for all methods that allows custom storage options such as read_csv, read_excel, read_table etc

import os

# loading environment variables
ACCESS_KEY_ID = os.getenv("ACCESS_KEY_ID")
SECRET_ACCESS_KEY = os.getenv("SECRET_ACCESS_KEY")
ENDPOINT_URL = os.getenv("ENDPOINT_URL")

We need to override the client_kwargs and set the endpoint_url, in this case the address must be the gateway url. Example: https://gateway.us1.storjshare.io

storage_options = {
  'key': ACCESS_KEY_ID,
  'secret': SECRET_ACCESS_KEY,
  'client_kwargs': {
    'endpoint_url': ENDPOINT_URL
  }
}

Saving Data to Storj DCS

In this blog post, we are going to save and load our pandas Dataframe in CSV format. Other formats are allowed too, as mentioned in the previous section.

import numpy as np
import pandas as pd

bucket = "mybucket"
key = "random.csv"

# Creating a random dataframe.
df = pd.DataFrame(np.random.uniform(0,1,[10**3,3]), columns=list('ABC'))

# Saving as CSV
df.to_csv(
  f"s3://{bucket}/{key}",
  index=False,
  storage_options=storage_options)

Loading Data from Storj DCS

The load process is the same, just pass the storage_options as a parameter.

new_df = pd.read_csv(
  f"s3://{bucket}/{key}",
  storage_options=storage_options)

Conclusion

Using pandas + Storj DCS is very easy, just requires a few lines of configuration.

If you already use pandas with S3 the migration to Storj DCS is very straightforward.