PrestoClient

class omniduct.databases.presto.PrestoClient(session_properties=None, templates=None, template_context=None, default_format_opts=None, **kwargs)[source]

Bases: omniduct.databases.base.DatabaseClient, omniduct.databases._schemas.SchemasMixin

This Duct connects to a Facebook Presto server instance using the pyhive library.

In addition to the standard DatabaseClient API, PrestoClient adds a .schemas descriptor attribute, which enables a tab completion driven exploration of a Presto database’s schemas and tables.

Attributes:
  • catalog (str) – The default catalog to use in database queries.
  • schema (str) – The default schema/database to use in database queries.
  • connection_options (dict) – Additional options to pass on to pyhive.presto.connect(…).
Attributes inherited from Duct:
protocol (str): The name of the protocol for which this instance was
created (especially useful if a Duct subclass supports multiple protocols).
name (str): The name given to this Duct instance (defaults to class
name).
host (str): The host name providing the service (will be ‘127.0.0.1’, if
service is port forwarded from remote; use ._host to see remote host).
port (int): The port number of the service (will be the port-forwarded
local port, if relevant; for remote port use ._port).

username (str, bool): The username to use for the service. password (str, bool): The password to use for the service. registry (None, omniduct.registry.DuctRegistry): A reference to a

DuctRegistry instance for runtime lookup of other services.
remote (None, omniduct.remotes.base.RemoteClient): A reference to a
RemoteClient instance to manage connections to remote services.
cache (None, omniduct.caches.base.Cache): A reference to a Cache
instance to add support for caching, if applicable.
connection_fields (tuple<str>, list<str>): A list of instance attributes
to monitor for changes, whereupon the Duct instance should automatically disconnect. By default, the following attributes are monitored: ‘host’, ‘port’, ‘remote’, ‘username’, and ‘password’.
prepared_fields (tuple<str>, list<str>): A list of instance attributes to
be populated (if their values are callable) when the instance first connects to a service. Refer to Duct.prepare and Duct._prepare for more details. By default, the following attributes are prepared: ‘_host’, ‘_port’, ‘_username’, and ‘_password’.

Additional attributes including host, port, username and password are documented inline.

Class Attributes:
AUTO_LOGGING_SCOPE (bool): Whether this class should be used by omniduct
logging code as a “scope”. Should be overridden by subclasses as appropriate.
DUCT_TYPE (Duct.Type): The type of Duct service that is provided by
this Duct instance. Should be overridden by subclasses as appropriate.
PROTOCOLS (list<str>): The name(s) of any protocols that should be
associated with this class. Should be overridden by subclasses as appropriate.
class Type

Bases: enum.Enum

The Duct.Type enum specifies all of the permissible values of Duct.DUCT_TYPE. Also determines the order in which ducts are loaded by DuctRegistry.

__init__(session_properties=None, templates=None, template_context=None, default_format_opts=None, **kwargs)
protocol (str, None): Name of protocol (used by Duct registries to inform
Duct instances of how they were instantiated).
name (str, None): The name to used by the Duct instance (defaults to
class name if not specified).
registry (DuctRegistry, None): The registry to use to lookup remote
and/or cache instance specified by name.
remote (str, RemoteClient): The remote by which the ducted service
should be contacted.

host (str): The hostname of the service to be used by this client. port (int): The port of the service to be used by this client. username (str, bool, None): The username to authenticate with if necessary.

If True, then users will be prompted at runtime for credentials.
password (str, bool, None): The password to authenticate with if necessary.
If True, then users will be prompted at runtime for credentials.
cache(Cache, None): The cache client to be attached to this instance.
Cache will only used by specific methods as configured by the client.
cache_namespace(str, None): The namespace to use by default when writing
to the cache.
DatabaseClient Quirks:
session_properties (dict): A mapping of default session properties
to values. Interpretation is left up to implementations.
templates (dict): A dictionary of name to template mappings. Additional
templates can be added using .template_add.
template_context (dict): The default template context to use when
rendering templates.
default_format_opts (dict): The default formatting options passed to
cursor formatter.
PrestoClient Quirks:

catalog (str): The default catalog to use in database queries. schema (str): The default schema/database to use in database queries. server_protocol (str): The protocol over which to connect to the Presto REST

service (‘http’ or ‘https’). (default=’http’)
source (str): The source of this query (by default “omniduct <version>”).
If manually specified, result will be: “<source> / omniduct <version>”.
connect()

Connect to the service backing this client.

It is not normally necessary for a user to manually call this function, since when a connection is required, it is automatically created.

Returns:A reference to the current object.
Return type:Duct instance
dataframe_to_table(df, table, if_exists='fail', **kwargs)

Upload a local pandas dataframe into a table in this database.

Parameters:
  • df (pandas.DataFrame) – The dataframe to upload into the database.
  • table (str, ParsedNamespaces) – The name of the table into which the dataframe should be uploaded.
  • if_exists (str) – if nominated table already exists: ‘fail’ to do nothing, ‘replace’ to drop, recreate and insert data into new table, and ‘append’ to add data from this table into the existing table.
  • **kwargs (dict) – Additional keyword arguments to pass onto DatabaseClient._dataframe_to_table.
PrestoClient Quirks:
If if the schema namespace is not specified, table.schema will be defaulted to your username. Catalog overrides will be ignored, and will default to self.catalog.
disconnect()

Disconnect this client from backing service.

This method is automatically called during reconnections and/or at Python interpreter shutdown. It first calls Duct._disconnect (which should be implemented by subclasses) and then notifies the RemoteClient subclass, if present, to stop port-forwarding the remote service.

Returns:A reference to this object.
Return type:Duct instance
execute(statement, wait=True, cursor=None, session_properties=None, **kwargs)

Execute a statement against this database and return a cursor object.

Where supported by database implementations, this cursor can the be used in future executions, by passing it as the cursor keyword argument.

Parameters:
  • statement (str) – The statement to be executed by the query client (possibly templated).
  • wait (bool) – Whether the cursor should be returned before the server-side query computation is complete and the relevant results downloaded.
  • cursor (DBAPI2 cursor) – Rather than creating a new cursor, execute the statement against the provided cursor.
  • session_properties (dict) – Additional session properties and/or overrides to use for this query. Setting a session property value to None will cause it to be omitted.
  • **kwargs (dict) – Extra keyword arguments to be passed on to _execute, as implemented by subclasses.
  • template (bool) – Whether the statement should be treated as a Jinja2 template. [Used by render_statement decorator.]
  • context (dict) – The context in which the template should be evaluated (a dictionary of parameters to values). [Used by render_statement decorator.]
  • use_cache (bool) – True or False (default). Whether to use the cache (if present). [Used by cached_method decorator.]
  • renew (bool) – True or False (default). If cache is being used, renew it before returning stored value. [Used by cached_method decorator.]
  • cleanup (bool) – Whether statement should be cleaned up before computing the hash used to cache results. [Used by cached_method decorator.]
Returns:

A DBAPI2 compatible cursor instance.

Return type:

DBAPI2 cursor

PrestoClient Quirks:
If something goes wrong, PrestoClient will attempt to parse the error log and present the user with useful debugging information. If that fails, the full traceback will be raised instead.
execute_from_file(file, fs=None, **kwargs)

Execute a statement stored in a file.

Parameters:
  • file (str, file-like-object) – The path of the file containing the query statement to be executed against the database, or an open file-like resource.
  • fs (None, FileSystemClient) – The filesystem wihin which the nominated file should be found. If None, the local filesystem will be used.
  • **kwargs (dict) – Extra keyword arguments to pass on to DatabaseClient.execute.
Returns:

A DBAPI2 compatible cursor instance.

Return type:

DBAPI2 cursor

execute_from_template(name, context=None, **kwargs)

Render and then execute a named template.

Parameters:
  • name (str) – The name of the template to be rendered and executed.
  • context (dict) – The context in which the template should be rendered.
  • **kwargs (dict) – Additional parameters to pass to .execute().
Returns:

A DBAPI2 compatible cursor instance.

Return type:

DBAPI2 cursor

classmethod for_protocol(protocol)

Retrieve a Duct subclass for a given protocol.

Parameters:protocol (str) – The protocol of interest.
Returns:
The appropriate class for the provided,
partially constructed with the protocol keyword argument set appropriately.
Return type:functools.partial object
Raises:DuctProtocolUnknown – If no class has been defined that offers the named protocol.
host

The host name providing the service, or ‘127.0.0.1’ if self.remote is not None, whereupon the service will be port-forwarded locally. You can view the remote hostname using duct._host, and change the remote host at runtime using: duct.host = ‘<host>’.

Type:str
is_connected()

Check whether this Duct instances is currently connected.

This method checks to see whether a Duct instance is currently connected. This is performed by verifying that the remote host and port are still accessible, and then by calling Duct._is_connected, which should be implemented by subclasses.

Returns:Whether this Duct instance is currently connected.
Return type:bool
password

Some services require authentication in order to connect to the service, in which case the appropriate password can be specified. If True was provided at instantiation, you will be prompted to type your password at runtime when necessary. If False was provided, then None will be returned. You can specify a different password at runtime using: duct.password = ‘<password>’.

Type:str
port

The local port for the service. If self.remote is not None, the port will be port-forwarded from the remote host. To see the port used on the remote host refer to duct._port. You can change the remote port at runtime using: duct.port = <port>.

Type:int
prepare()

Prepare a Duct subclass for use (if not already prepared).

This method is called before the value of any of the fields referenced in self.connection_fields are retrieved. The fields include, by default: ‘host’, ‘port’, ‘remote’, ‘cache’, ‘username’, and ‘password’. Subclasses may add or subtract from these special fields.

When called, it first checks whether the instance has already been prepared, and if not calls _prepare and then records that the instance has been successfully prepared.

PrestoClient Quirks:

This method may be overridden by subclasses, but provides the following default behaviour:

  • Ensures self.registry, self.remote and self.cache values are instances of the right types.
  • It replaces string values of self.remote and self.cache with remotes and caches looked up using self.registry.lookup.
  • It looks through each of the fields nominated in self.prepared_fields and, if the corresponding value is callable, sets the value of that field to result of calling that value with a reference to self. By default, prepared_fields contains ‘_host’, ‘_port’, ‘_username’, and ‘_password’.
  • Ensures value of self.port is an integer (or None).
query(statement, format=None, format_opts={}, use_cache=True, **kwargs)

Execute a statement against this database and collect formatted data.

Parameters:
  • statement (str) – The statement to be executed by the query client (possibly templated).
  • format (str) – A subclass of CursorFormatter, or one of: ‘pandas’, ‘hive’, ‘csv’, ‘tuple’ or ‘dict’. Defaults to self.DEFAULT_CURSOR_FORMATTER.
  • format_opts (dict) – A dictionary of format-specific options.
  • use_cache (bool) – Whether to cache the cursor returned by DatabaseClient.execute() (overrides the default of False for .execute()). (default=True)
  • **kwargs (dict) – Additional arguments to pass on to DatabaseClient.execute().
Returns:

The results of the query formatted as nominated.

query_from_file(file, fs=None, **kwargs)

Query using a statement stored in a file.

Parameters:
  • file (str, file-like-object) – The path of the file containing the query statement to be executed against the database, or an open file-like resource.
  • fs (None, FileSystemClient) – The filesystem wihin which the nominated file should be found. If None, the local filesystem will be used.
  • **kwargs (dict) – Extra keyword arguments to pass on to DatabaseClient.query.
Returns:

The results of the query formatted as nominated.

Return type:

object

query_from_template(name, context=None, **kwargs)

Render and then query using a named tempalte.

Parameters:
  • name (str) – The name of the template to be rendered and used to query the database.
  • context (dict) – The context in which the template should be rendered.
  • **kwargs (dict) – Additional parameters to pass to .query().
Returns:

The results of the query formatted as nominated.

Return type:

object

query_to_table(statement, table, if_exists='fail', **kwargs)

Run a query and store the results in a table in this database.

Parameters:
  • statement – The statement to be executed.
  • table (str) – The name of the table into which the dataframe should be uploaded.
  • if_exists (str) – if nominated table already exists: ‘fail’ to do nothing, ‘replace’ to drop, recreate and insert data into new table, and ‘append’ to add data from this table into the existing table.
  • **kwargs (dict) – Additional keyword arguments to pass onto DatabaseClient._query_to_table.
Returns:

The cursor object associated with the execution.

Return type:

DB-API cursor

reconnect()

Disconnects, and then reconnects, this client.

Note: This is equivalent to duct.disconnect().connect().

Returns:A reference to this object.
Return type:Duct instance
register_magics(base_name=None)

The following magic functions will be registered (assuming that the base name is chosen to be ‘hive’): - Cell Magics:

  • %%hive: For querying the database.
  • %%hive.execute: For executing a statement against the database.
  • %%hive.stream: For executing a statement against the database,
    and streaming the results.
  • %%hive.template: The defining a new template.
  • %%hive.render: Render a provided query statement.
  • Line Magics:
    • %hive: For querying the database using a named template.
    • %hive.execute: For executing a named template statement against
      the database.
    • %hive.stream: For executing a named template against the database,
      and streaming the results.
    • %hive.render: Render a provided a named template.
    • %hive.desc: Describe the table nominated.
    • %hive.head: Return the first rows in a specified table.
    • %hive.props: Show the properties specified for a nominated table.

Documentation for these magics is provided online.

reset()

Reset this Duct instance to its pre-preparation state.

This method disconnects from the service, resets any temporary authentication and restores the values of the attributes listed in prepared_fields to their values as of when Duct.prepare was called.

Returns:A reference to this object.
Return type:Duct instance
schemas

An object with attributes corresponding to the names of the schemas in this database.

Type:object
session_properties

The default session properties used in statement executions.

Type:dict
classmethod statement_cleanup(statement)

Clean up statements prior to hash computation.

This classmethod takes an SQL statement and reformats it by consistently removing comments and replacing all whitespace. It is used by the statement_hash method to avoid functionally identical queries hitting different cache keys. If the statement’s language is not to be SQL, this method should be overloaded appropriately.

Parameters:statement (str) – The statement to be reformatted/cleaned-up.
Returns:The new statement, consistently reformatted.
Return type:str
classmethod statement_hash(statement, cleanup=True)

Retrieve the hash to use to identify query statements to the cache.

Parameters:
  • statement (str) – A string representation of the statement to be hashed.
  • cleanup (bool) – Whether the statement should first be consistently reformatted using statement_cleanup.
Returns:

The hash used to identify a statement to the cache.

Return type:

str

stream(statement, format=None, format_opts={}, batch=None, **kwargs)

Execute a statement against this database and stream formatted results.

This method returns a generator over objects representing rows in the result set. If batch is not None, then the iterator will be over lists of length batch containing formatted rows.

Parameters:
  • statement (str) – The statement to be executed against the database.
  • format (str) – A subclass of CursorFormatter, or one of: ‘pandas’, ‘hive’, ‘csv’, ‘tuple’ or ‘dict’. Defaults to self.DEFAULT_CURSOR_FORMATTER.
  • format_opts (dict) – A dictionary of format-specific options.
  • batch (int) – If not None, the number of rows from the resulting cursor to be returned at once.
  • **kwargs (dict) – Additional keyword arguments to pass onto DatabaseClient.execute.
Returns:

An iterator over objects of the nominated format or, if

batched, a list of such objects.

Return type:

iterator

stream_to_file(statement, file, format='csv', fs=None, **kwargs)

Execute a statement against this database and stream results to a file.

This method is a wrapper around DatabaseClient.stream that enables the iterative writing of cursor results to a file. This is especially useful when there are a very large number of results, and loading them all into memory would require considerable resources. Note that ‘csv’ is the default format for this method (rather than pandas).

Parameters:
  • statement (str) – The statement to be executed against the database.
  • file (str, file-like-object) – The filename where the data should be written, or an open file-like resource.
  • format (str) – The format to be used (‘csv’ by default). Format options can be passed via **kwargs.
  • fs (None, FileSystemClient) – The filesystem wihin which the nominated file should be found. If None, the local filesystem will be used.
  • **kwargs – Additional keyword arguments to pass onto DatabaseClient.stream.
table_desc(table, renew=True, **kwargs)

Describe a table in the database.

Parameters:
  • table (str) – The table to describe.
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A dataframe description of the table.

Return type:

pandas.DataFrame

table_drop(table, **kwargs)

Remove a table from the database.

Parameters:
  • table (str) – The table to drop.
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

The cursor associated with this execution.

Return type:

DB-API cursor

table_exists(table, renew=True, **kwargs)

Check whether a table exists.

Parameters:
  • table (str) – The table for which to check.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

True if table exists, and False otherwise.

Return type:

bool

table_head(table, n=10, renew=True, **kwargs)

Retrieve the first n rows from a table.

Parameters:
  • table (str) – The table from which to extract data.
  • n (int) – The number of rows to extract.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A dataframe representation of the first n rows

of the nominated table.

Return type:

pandas.DataFrame

table_list(namespace=None, renew=True, **kwargs)

Return a list of table names in the data source as a DataFrame.

Parameters:
  • namespace (str) – The namespace in which to look for tables.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

The names of schemas in this database.

Return type:

list<str>

table_props(table, renew=True, **kwargs)

Retrieve the properties associated with a table.

Parameters:
  • table (str) – The table from which to extract data.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A dataframe representation of the table

properties.

Return type:

pandas.DataFrame

template_add(name, body)

Add a named template to the internal dictionary of templates.

Note: Templates are interpreted as jinja2 templates. See .template_render for more information.

Parameters:
  • name (str) – The name of the template.
  • body (str) – The (typically) multiline body of the template.
Returns:

A reference to this object.

Return type:

PrestoClient

template_get(name)

Retrieve a named template.

Parameters:name (str) – The name of the template to retrieve.
Raises:ValueError – If name is not associated with a template.
Returns:The requested template.
Return type:str
template_names

A list of names associated with the templates associated with this client.

Type:list
template_render(name_or_statement, context=None, by_name=False, cleanup=False, meta_only=False)

Render a template by name or value.

In addition to the jinja2 templating syntax, described in more detail in the official jinja2 documentation, a meta-templating extension is also provided. This meta-templating allows you to reference other reference other templates. For example, if you had two SQL templates named ‘template_a’ and ‘template_b’, then you could render them into one SQL query using (for example):

.template_render('''
WITH
    a AS (
        {{{template_a}}}
    ),
    b AS (
        {{{template_b}}}
    )
SELECT *
FROM a
JOIN b ON a.x = b.x
''')

Note that template substitution in this way is iterative, so you can chain template embedding, provided that such embedding is not recursive.

Parameters:
  • name_or_statement (str) – The name of a template (if by_name is True) or else a string representation of a jinja2 template.
  • context (dict, None) – A dictionary to use as the template context. If not specified, an empty dictionary is used.
  • by_name (bool) – True if name_or_statement should be interpreted as a template name, or False (default) if name_or_statement should be interpreted as a template body.
  • cleanup (bool) – True if the rendered statement should be formatted, False (default) otherwise
  • meta_only (bool) – True if rendering should only progress as far as rendering nested templates (i.e. don’t actually substitute in variables from the context); False (default) otherwise.
Returns:

The rendered template.

Return type:

str

template_variables(name_or_statement, by_name=False)

Return the set of undeclared variables required for this template.

Parameters:
  • name_or_statement (str) – The name of a template (if by_name is True) or else a string representation of a jinja2 template.
  • by_name (bool) – True if name_or_statement should be interpreted as a template name, or False (default) if name_or_statement should be interpreted as a template body.
Returns:

A set of names which the template requires to be rendered.

Return type:

set<str>

username

Some services require authentication in order to connect to the service, in which case the appropriate username can be specified. If not specified at instantiation, your local login name will be used. If True was provided, you will be prompted to type your username at runtime as necessary. If False was provided, then None will be returned. You can specify a different username at runtime using: duct.username = ‘<username>’.

Type:str