S3Client

class omniduct.filesystems.s3.S3Client(cwd=None, home=None, read_only=False, global_writes=False, **kwargs)[source]

Bases: omniduct.filesystems.base.FileSystemClient

This Duct connects to an Amazon S3 bucket instance using the boto3 library. Authentication is (optionally) handled using opinel.

Attributes:
  • bucket (str) – The name of the Amazon S3 bucket to use.
  • aws_profile (str) – The name of configured AWS profile to use. This should refer to the name of a profile configured in, for example, ~/.aws/credentials. Authentication is handled by the opinel library, which is also aware of environment variables.
Attributes inherited from Duct:
protocol (str): The name of the protocol for which this instance was
created (especially useful if a Duct subclass supports multiple protocols).
name (str): The name given to this Duct instance (defaults to class
name).
host (str): The host name providing the service (will be ‘127.0.0.1’, if
service is port forwarded from remote; use ._host to see remote host).
port (int): The port number of the service (will be the port-forwarded
local port, if relevant; for remote port use ._port).

username (str, bool): The username to use for the service. password (str, bool): The password to use for the service. registry (None, omniduct.registry.DuctRegistry): A reference to a

DuctRegistry instance for runtime lookup of other services.
remote (None, omniduct.remotes.base.RemoteClient): A reference to a
RemoteClient instance to manage connections to remote services.
cache (None, omniduct.caches.base.Cache): A reference to a Cache
instance to add support for caching, if applicable.
connection_fields (tuple<str>, list<str>): A list of instance attributes
to monitor for changes, whereupon the Duct instance should automatically disconnect. By default, the following attributes are monitored: ‘host’, ‘port’, ‘remote’, ‘username’, and ‘password’.
prepared_fields (tuple<str>, list<str>): A list of instance attributes to
be populated (if their values are callable) when the instance first connects to a service. Refer to Duct.prepare and Duct._prepare for more details. By default, the following attributes are prepared: ‘_host’, ‘_port’, ‘_username’, and ‘_password’.

Additional attributes including host, port, username and password are documented inline.

Class Attributes:
AUTO_LOGGING_SCOPE (bool): Whether this class should be used by omniduct
logging code as a “scope”. Should be overridden by subclasses as appropriate.
DUCT_TYPE (Duct.Type): The type of Duct service that is provided by
this Duct instance. Should be overridden by subclasses as appropriate.
PROTOCOLS (list<str>): The name(s) of any protocols that should be
associated with this class. Should be overridden by subclasses as appropriate.
class Type

Bases: enum.Enum

The Duct.Type enum specifies all of the permissible values of Duct.DUCT_TYPE. Also determines the order in which ducts are loaded by DuctRegistry.

__init__(cwd=None, home=None, read_only=False, global_writes=False, **kwargs)
protocol (str, None): Name of protocol (used by Duct registries to inform
Duct instances of how they were instantiated).
name (str, None): The name to used by the Duct instance (defaults to
class name if not specified).
registry (DuctRegistry, None): The registry to use to lookup remote
and/or cache instance specified by name.
remote (str, RemoteClient): The remote by which the ducted service
should be contacted.

host (str): The hostname of the service to be used by this client. port (int): The port of the service to be used by this client. username (str, bool, None): The username to authenticate with if necessary.

If True, then users will be prompted at runtime for credentials.
password (str, bool, None): The password to authenticate with if necessary.
If True, then users will be prompted at runtime for credentials.
cache(Cache, None): The cache client to be attached to this instance.
Cache will only used by specific methods as configured by the client.
cache_namespace(str, None): The namespace to use by default when writing
to the cache.
FileSystemClient Quirks:
cwd (None, str): The path prefix to use as the current working directory
(if None, the user’s home directory is used where that makes sense).
home (None, str): The path prefix to use as the current users’ home
directory. If not specified, it will default to an implementation- specific value (often ‘/’).
read_only (bool): Whether the filesystem should only be able to perform
read operations.
global_writes (bool): Whether to allow writes outside of the user’s home
folder.

**kwargs (dict): Additional keyword arguments to passed on to subclasses.

S3Client Quirks:

bucket (str): The name of the Amazon S3 bucket to use. aws_profile (str): The name of configured AWS profile to use. This should

refer to the name of a profile configured in, for example, ~/.aws/credentials. Authentication is (optionally) handled by the opinel library, which is also aware of environment variables.
use_opinel (bool): Use Opinel to extract AWS credentials. This is mainly
useful if you have used opinel to set up MFA. Note: Opinel must be installed manually alongside omniduct to take advantage of this feature.
session (botocore.session.Session): A pre-configured botocore Session
instance to use instead of creating a new one when this client connects.
path_separator (str): Amazon S3 is essentially a key-based storage
system, and so one is free to choose an arbitrary “directory” separator. This defaults to ‘/’ for consistency with other filesystems.
skip_hadoop_artifacts (bool): Whether to skip hadoop artifacts like
*_$folder$’ when enumerating directories (default=True).

Note 1: aws_profile, if specified, should be the name of a profile as specified in ~/.aws/credentials. Authentication is handled by the opinel library, which is also aware of environment variables. Set up your command line aws client, and if it works, this should too.

Note 2: Some institutions have nuanced AWS configurations that with configurations that are generated by scripts. It may be useful in these environments to subclass S3Client and override the _get_boto3_session method to suit your needs.

connect()

Connect to the service backing this client.

It is not normally necessary for a user to manually call this function, since when a connection is required, it is automatically created.

Returns:A reference to the current object.
Return type:Duct instance
dir(path=None)

Retrieve information about the children of a nominated directory.

This method returns a generator over FileSystemFileDesc objects that represent the files/directories that a present as children of the nominated path. If path is not a directory, an exception is raised. The path is interpreted as being relative to the current working directory (on remote filesytems, this will typically be the home folder).

Parameters:path (str) – The path to examine for children.
Returns:The children of path represented as FileSystemFileDesc objects.
Return type:generator<FileSystemFileDesc>
disconnect()

Disconnect this client from backing service.

This method is automatically called during reconnections and/or at Python interpreter shutdown. It first calls Duct._disconnect (which should be implemented by subclasses) and then notifies the RemoteClient subclass, if present, to stop port-forwarding the remote service.

Returns:A reference to this object.
Return type:Duct instance
download(source, dest=None, overwrite=False, fs=None)

Download files to another filesystem.

This method (recursively) downloads a file/folder from path source on this filesystem to the path dest on filesytem fs, overwriting any existing file if overwrite is True.

Parameters:
  • source (str) – The path on this filesystem of the file to download to the nominated filesystem (fs). If source ends with ‘/’ then contents of the the source directory will be copied into destination folder, and will throw an error if path does not resolve to a directory.
  • dest (str) – The destination path on filesystem (fs). If not specified, the file/folder is downloaded into the default path, usually one’s home folder. If dest ends with ‘/’, and corresponds to a directory, the contents of source will be copied instead of copying the entire folder. If dest is otherwise a directory, an exception will be raised.
  • overwrite (bool) – True if the contents of any existing file by the same name should be overwritten, False otherwise.
  • fs (FileSystemClient) – The FileSystemClient into which the nominated file/folder source should be downloaded. If not specified, defaults to the local filesystem.
exists(path)

Check whether nominated path exists on this filesytem.

Parameters:path (str) – The path for which to check existence.
Returns:
True if file/folder exists at nominated path, and False
otherwise.
Return type:bool
find(path_prefix=None, **attrs)

Find a file or directory based on certain attributes.

This method searches for files or folders which satisfy certain constraints on the attributes of the file (as encoded into FileSystemFileDesc). Note that without attribute constraints, this method will function identically to self.dir.

Parameters:
  • path_prefix (str) – The path under which files/directories should be found.
  • **attrs (dict) – Constraints on the fields of the FileSystemFileDesc objects associated with this filesystem, as constant values or callable objects (in which case the object will be called and should return True if attribute value is match, and False otherwise).
Returns:

A generator over FileSystemFileDesc

objects that are descendents of path_prefix and which statisfy provided constraints.

Return type:

generator<FileSystemFileDesc>

classmethod for_protocol(protocol)

Retrieve a Duct subclass for a given protocol.

Parameters:protocol (str) – The protocol of interest.
Returns:
The appropriate class for the provided,
partially constructed with the protocol keyword argument set appropriately.
Return type:functools.partial object
Raises:DuctProtocolUnknown – If no class has been defined that offers the named protocol.
global_writes

Whether writes should be permitted outside of home directory. This write-lock is designed to prevent inadvertent scripted writing in potentially dangerous places.

Type:bool
host

The host name providing the service, or ‘127.0.0.1’ if self.remote is not None, whereupon the service will be port-forwarded locally. You can view the remote hostname using duct._host, and change the remote host at runtime using: duct.host = ‘<host>’.

Type:str
is_connected()

Check whether this Duct instances is currently connected.

This method checks to see whether a Duct instance is currently connected. This is performed by verifying that the remote host and port are still accessible, and then by calling Duct._is_connected, which should be implemented by subclasses.

Returns:Whether this Duct instance is currently connected.
Return type:bool
isdir(path)

Check whether a nominated path is directory.

Parameters:path (str) – The path for which to check directory nature.
Returns:True if folder exists at nominated path, and False otherwise.
Return type:bool
isfile(path)

Check whether a nominated path is a file.

Parameters:path (str) – The path for which to check file nature.
Returns:True if a file exists at nominated path, and False otherwise.
Return type:bool
listdir(path=None)

Retrieve the names of the children of a nomianted directory.

This method inspects the contents of a directory using .dir(path), and returns the names of child members as strings. path is interpreted relative to the current working directory (on remote filesytems, this will typically be the home folder).

Parameters:path (str) – The path of the directory from which to enumerate filenames.
Returns:The names of all children of the nominated directory.
Return type:list<str>
mkdir(path, recursive=True, exist_ok=False)

Create a directory at the given path.

Parameters:
  • path (str) – The path of the directory to create.
  • recursive (bool) – Whether to recursively create any parents of this path if they do not already exist.

Note: exist_ok is passed onto subclass implementations of _mkdir rather that implementing the existence check using .exists so that they can avoid the overhead associated with multiple operations, which can be costly in some cases.

open(path, mode='rt')

Open a file for reading and/or writing.

This method opens the file at the given path for reading and/or writing operations. The object returned is programmatically interchangeable with any other Python file-like object, including specification of file modes. If the file is opened in write mode, changes will only be flushed to the source filesystem when the file is closed.

Parameters:
  • path (str) – The path of the file to open.
  • mode (str) – All standard Python file modes.
Returns:

An opened file-like object.

Return type:

FileSystemFile or file-like

password

Some services require authentication in order to connect to the service, in which case the appropriate password can be specified. If True was provided at instantiation, you will be prompted to type your password at runtime when necessary. If False was provided, then None will be returned. You can specify a different password at runtime using: duct.password = ‘<password>’.

Type:str
path_basename(path)

Extract the last component of a given path.

Components are determined by splitting by self.path_separator. Note that if a path ends with a path separator, the basename will be the empty string.

Parameters:path (str) – The path from which the basename should be extracted.
Returns:The extracted basename.
Return type:str
path_cwd

The path prefix associated with the current working directory. If not otherwise set, it will be the users’ home directory, and will be the prefix used by all non-absolute path references on this filesystem.

Type:str
path_dirname(path)

Extract the parent directory for provided path.

This method returns the entire path except for the basename (the last component), where components are determined by splitting by self.path_separator.

Parameters:path (str) – The path from which the directory path should be extracted.
Returns:The extracted directory path.
Return type:str
path_home

The path prefix to use as the current users’ home directory. Unless cwd is set, this will be the prefix to use for all non-absolute path references on this filesystem. This is assumed not to change between connections, and so will not be updated on client reconnections. Unless global_writes is set to True, this will be the only folder into which this client is permitted to write.

Type:str
path_join(path, *components)

Generate a new path by joining together multiple paths.

If any component starts with self.path_separator or ‘~’, then all previous path components are discarded, and the effective base path becomes that component (with ‘~’ expanding to self.path_home). Note that this method does not simplify paths components like ‘..’. Use self.path_normpath for this purpose.

Parameters:
  • path (str) – The base path to which components should be joined.
  • *components (str) – Any additional components to join to the base path.
Returns:

The path resulting from joining all of the components nominated, in order, to the base path.

Return type:

str

path_normpath(path)

Normalise a pathname.

This method returns the normalised (absolute) path corresponding to path on this filesystem.

Parameters:path (str) – The path to normalise (make absolute).
Returns:The normalised path.
Return type:str
path_separator

The character(s) to use in separating path components. Typically this will be ‘/’.

Type:str
port

The local port for the service. If self.remote is not None, the port will be port-forwarded from the remote host. To see the port used on the remote host refer to duct._port. You can change the remote port at runtime using: duct.port = <port>.

Type:int
prepare()

Prepare a Duct subclass for use (if not already prepared).

This method is called before the value of any of the fields referenced in self.connection_fields are retrieved. The fields include, by default: ‘host’, ‘port’, ‘remote’, ‘cache’, ‘username’, and ‘password’. Subclasses may add or subtract from these special fields.

When called, it first checks whether the instance has already been prepared, and if not calls _prepare and then records that the instance has been successfully prepared.

S3Client Quirks:

This method may be overridden by subclasses, but provides the following default behaviour:

  • Ensures self.registry, self.remote and self.cache values are instances of the right types.
  • It replaces string values of self.remote and self.cache with remotes and caches looked up using self.registry.lookup.
  • It looks through each of the fields nominated in self.prepared_fields and, if the corresponding value is callable, sets the value of that field to result of calling that value with a reference to self. By default, prepared_fields contains ‘_host’, ‘_port’, ‘_username’, and ‘_password’.
  • Ensures value of self.port is an integer (or None).
read_only

Whether this filesystem client should be permitted to attempt any write operations.

Type:bool
reconnect()

Disconnects, and then reconnects, this client.

Note: This is equivalent to duct.disconnect().connect().

Returns:A reference to this object.
Return type:Duct instance
remove(path, recursive=False)

Remove file(s) at a nominated path.

Directories (and their contents) will not be removed unless recursive is set to True.

Parameters:
  • path (str) – The path of the file/directory to be removed.
  • recursive (bool) – Whether to remove directories and all of their contents.
reset()

Reset this Duct instance to its pre-preparation state.

This method disconnects from the service, resets any temporary authentication and restores the values of the attributes listed in prepared_fields to their values as of when Duct.prepare was called.

Returns:A reference to this object.
Return type:Duct instance
showdir(path=None)

Return a dataframe representation of a directory.

This method returns a pandas.DataFrame representation of the contents of a path, which are retrieved using .dir(path). The exact columns will vary from filesystem to filesystem, depending on the fields returned by .dir(), but the returned DataFrame is guaranteed to at least have the columns: ‘name’ and ‘type’.

Parameters:path (str) – The path of the directory from which to show contents.
Returns:A DataFrame representation of the contents of the nominated directory.
Return type:pandas.DataFrame
upload(source, dest=None, overwrite=False, fs=None)

Upload files from another filesystem.

This method (recursively) uploads a file/folder from path source on filesystem fs to the path dest on this filesytem, overwriting any existing file if overwrite is True. This is equivalent to fs.download(…, fs=self).

Parameters:
  • source (str) – The path on the specified filesystem (fs) of the file to upload to this filesystem. If source ends with ‘/’, and corresponds to a directory, the contents of source will be copied instead of copying the entire folder.
  • dest (str) – The destination path on this filesystem. If not specified, the file/folder is uploaded into the default path, usually one’s home folder, on this filesystem. If dest ends with ‘/’ then file will be copied into destination folder, and will throw an error if path does not resolve to a directory.
  • overwrite (bool) – True if the contents of any existing file by the same name should be overwritten, False otherwise.
  • fs (FileSystemClient) – The FileSystemClient from which to load the file/folder at source. If not specified, defaults to the local filesystem.
username

Some services require authentication in order to connect to the service, in which case the appropriate username can be specified. If not specified at instantiation, your local login name will be used. If True was provided, you will be prompted to type your username at runtime as necessary. If False was provided, then None will be returned. You can specify a different username at runtime using: duct.username = ‘<username>’.

Type:str
walk(path=None)

Explore the filesystem tree starting at a nominated path.

This method returns a generator which recursively walks over all paths that are children of path, one result for each directory, of form: (<path name>, [<directory 1>, …], [<file 1>, …])

Parameters:path (str) – The path of the directory from which to enumerate contents.
Returns:A generator of tuples, each tuple being associated with one directory that is either path or one of its descendants.
Return type:generator<tuple>