Databases

All database clients are expected to be subclasses of DatabaseClient, and so will share a common API and inherit a suite of IPython magics. Protocol implementations are also free to add extra methods, which are documented in the “Subclass Reference” section below.

Common API

class omniduct.databases.base.DatabaseClient(session_properties=None, templates=None, template_context=None, default_format_opts=None, **kwargs)[source]

Bases: omniduct.duct.Duct, omniduct.utils.magics.MagicsProvider

An abstract class providing the common API for all database clients.

Note: DatabaseClient subclasses are callable, so that one can use DatabaseClient(…) as a short-hand for DatabaseClient.query(…).

Class Attributes:
 
  • DUCT_TYPE (Duct.Type) – The type of Duct protocol implemented by this class.
  • DEFAULT_PORT (int) – The default port for the database service (defined by subclasses).
  • CURSOR_FORMATTERS (dict<str, CursorFormatter) – asdsd
  • DEFAULT_CURSOR_FORMATTER (str) – …
Attributes inherited from Duct:
protocol (str): The name of the protocol for which this instance was
created (especially useful if a Duct subclass supports multiple protocols).
name (str): The name given to this Duct instance (defaults to class
name).
host (str): The host name providing the service (will be ‘127.0.0.1’, if
service is port forwarded from remote; use ._host to see remote host).
port (int): The port number of the service (will be the port-forwarded
local port, if relevant; for remote port use ._port).

username (str, bool): The username to use for the service. password (str, bool): The password to use for the service. registry (None, omniduct.registry.DuctRegistry): A reference to a

DuctRegistry instance for runtime lookup of other services.
remote (None, omniduct.remotes.base.RemoteClient): A reference to a
RemoteClient instance to manage connections to remote services.
cache (None, omniduct.caches.base.Cache): A reference to a Cache
instance to add support for caching, if applicable.
connection_fields (tuple<str>, list<str>): A list of instance attributes
to monitor for changes, whereupon the Duct instance should automatically disconnect. By default, the following attributes are monitored: ‘host’, ‘port’, ‘remote’, ‘username’, and ‘password’.
prepared_fields (tuple<str>, list<str>): A list of instance attributes to
be populated (if their values are callable) when the instance first connects to a service. Refer to Duct.prepare and Duct._prepare for more details. By default, the following attributes are prepared: ‘_host’, ‘_port’, ‘_username’, and ‘_password’.

Additional attributes including host, port, username and password are documented inline.

Class Attributes:
AUTO_LOGGING_SCOPE (bool): Whether this class should be used by omniduct
logging code as a “scope”. Should be overridden by subclasses as appropriate.
DUCT_TYPE (Duct.Type): The type of Duct service that is provided by
this Duct instance. Should be overridden by subclasses as appropriate.
PROTOCOLS (list<str>): The name(s) of any protocols that should be
associated with this class. Should be overridden by subclasses as appropriate.
NAMESPACE_DEFAULTS_READ

Backwards compatible shim for NAMESPACE_DEFAULTS.

NAMESPACE_DEFAULTS_WRITE

Unless overridden, this is the same as NAMESPACE_DEFAULTS_READ.

__init__(session_properties=None, templates=None, template_context=None, default_format_opts=None, **kwargs)[source]
session_properties (dict): A mapping of default session properties
to values. Interpretation is left up to implementations.
templates (dict): A dictionary of name to template mappings. Additional
templates can be added using .template_add.
template_context (dict): The default template context to use when
rendering templates.
default_format_opts (dict): The default formatting options passed to
cursor formatter.
session_properties

The default session properties used in statement executions.

Type:dict
classmethod statement_hash(statement, cleanup=True)[source]

Retrieve the hash to use to identify query statements to the cache.

Parameters:
  • statement (str) – A string representation of the statement to be hashed.
  • cleanup (bool) – Whether the statement should first be consistently reformatted using statement_cleanup.
Returns:

The hash used to identify a statement to the cache.

Return type:

str

classmethod statement_cleanup(statement)[source]

Clean up statements prior to hash computation.

This classmethod takes an SQL statement and reformats it by consistently removing comments and replacing all whitespace. It is used by the statement_hash method to avoid functionally identical queries hitting different cache keys. If the statement’s language is not to be SQL, this method should be overloaded appropriately.

Parameters:statement (str) – The statement to be reformatted/cleaned-up.
Returns:The new statement, consistently reformatted.
Return type:str
execute(statement, wait=True, cursor=None, session_properties=None, **kwargs)[source]

Execute a statement against this database and return a cursor object.

Where supported by database implementations, this cursor can the be used in future executions, by passing it as the cursor keyword argument.

Parameters:
  • statement (str) – The statement to be executed by the query client (possibly templated).
  • wait (bool) – Whether the cursor should be returned before the server-side query computation is complete and the relevant results downloaded.
  • cursor (DBAPI2 cursor) – Rather than creating a new cursor, execute the statement against the provided cursor.
  • session_properties (dict) – Additional session properties and/or overrides to use for this query. Setting a session property value to None will cause it to be omitted.
  • **kwargs (dict) – Extra keyword arguments to be passed on to _execute, as implemented by subclasses.
  • template (bool) – Whether the statement should be treated as a Jinja2 template. [Used by render_statement decorator.]
  • context (dict) – The context in which the template should be evaluated (a dictionary of parameters to values). [Used by render_statement decorator.]
  • use_cache (bool) – True or False (default). Whether to use the cache (if present). [Used by cached_method decorator.]
  • renew (bool) – True or False (default). If cache is being used, renew it before returning stored value. [Used by cached_method decorator.]
  • cleanup (bool) – Whether statement should be cleaned up before computing the hash used to cache results. [Used by cached_method decorator.]
Returns:

A DBAPI2 compatible cursor instance.

Return type:

DBAPI2 cursor

query(statement, format=None, format_opts={}, use_cache=True, **kwargs)[source]

Execute a statement against this database and collect formatted data.

Parameters:
  • statement (str) – The statement to be executed by the query client (possibly templated).
  • format (str) – A subclass of CursorFormatter, or one of: ‘pandas’, ‘hive’, ‘csv’, ‘tuple’ or ‘dict’. Defaults to self.DEFAULT_CURSOR_FORMATTER.
  • format_opts (dict) – A dictionary of format-specific options.
  • use_cache (bool) – Whether to cache the cursor returned by DatabaseClient.execute() (overrides the default of False for .execute()). (default=True)
  • **kwargs (dict) – Additional arguments to pass on to DatabaseClient.execute().
Returns:

The results of the query formatted as nominated.

stream(statement, format=None, format_opts={}, batch=None, **kwargs)[source]

Execute a statement against this database and stream formatted results.

This method returns a generator over objects representing rows in the result set. If batch is not None, then the iterator will be over lists of length batch containing formatted rows.

Parameters:
  • statement (str) – The statement to be executed against the database.
  • format (str) – A subclass of CursorFormatter, or one of: ‘pandas’, ‘hive’, ‘csv’, ‘tuple’ or ‘dict’. Defaults to self.DEFAULT_CURSOR_FORMATTER.
  • format_opts (dict) – A dictionary of format-specific options.
  • batch (int) – If not None, the number of rows from the resulting cursor to be returned at once.
  • **kwargs (dict) – Additional keyword arguments to pass onto DatabaseClient.execute.
Returns:

An iterator over objects of the nominated format or, if

batched, a list of such objects.

Return type:

iterator

stream_to_file(statement, file, format='csv', fs=None, **kwargs)[source]

Execute a statement against this database and stream results to a file.

This method is a wrapper around DatabaseClient.stream that enables the iterative writing of cursor results to a file. This is especially useful when there are a very large number of results, and loading them all into memory would require considerable resources. Note that ‘csv’ is the default format for this method (rather than pandas).

Parameters:
  • statement (str) – The statement to be executed against the database.
  • file (str, file-like-object) – The filename where the data should be written, or an open file-like resource.
  • format (str) – The format to be used (‘csv’ by default). Format options can be passed via **kwargs.
  • fs (None, FileSystemClient) – The filesystem wihin which the nominated file should be found. If None, the local filesystem will be used.
  • **kwargs – Additional keyword arguments to pass onto DatabaseClient.stream.
execute_from_file(file, fs=None, **kwargs)[source]

Execute a statement stored in a file.

Parameters:
  • file (str, file-like-object) – The path of the file containing the query statement to be executed against the database, or an open file-like resource.
  • fs (None, FileSystemClient) – The filesystem wihin which the nominated file should be found. If None, the local filesystem will be used.
  • **kwargs (dict) – Extra keyword arguments to pass on to DatabaseClient.execute.
Returns:

A DBAPI2 compatible cursor instance.

Return type:

DBAPI2 cursor

query_from_file(file, fs=None, **kwargs)[source]

Query using a statement stored in a file.

Parameters:
  • file (str, file-like-object) – The path of the file containing the query statement to be executed against the database, or an open file-like resource.
  • fs (None, FileSystemClient) – The filesystem wihin which the nominated file should be found. If None, the local filesystem will be used.
  • **kwargs (dict) – Extra keyword arguments to pass on to DatabaseClient.query.
Returns:

The results of the query formatted as nominated.

Return type:

object

template_names

A list of names associated with the templates associated with this client.

Type:list
template_add(name, body)[source]

Add a named template to the internal dictionary of templates.

Note: Templates are interpreted as jinja2 templates. See .template_render for more information.

Parameters:
  • name (str) – The name of the template.
  • body (str) – The (typically) multiline body of the template.
Returns:

A reference to this object.

Return type:

PrestoClient

template_get(name)[source]

Retrieve a named template.

Parameters:name (str) – The name of the template to retrieve.
Raises:ValueError – If name is not associated with a template.
Returns:The requested template.
Return type:str
template_variables(name_or_statement, by_name=False)[source]

Return the set of undeclared variables required for this template.

Parameters:
  • name_or_statement (str) – The name of a template (if by_name is True) or else a string representation of a jinja2 template.
  • by_name (bool) – True if name_or_statement should be interpreted as a template name, or False (default) if name_or_statement should be interpreted as a template body.
Returns:

A set of names which the template requires to be rendered.

Return type:

set<str>

template_render(name_or_statement, context=None, by_name=False, cleanup=False, meta_only=False)[source]

Render a template by name or value.

In addition to the jinja2 templating syntax, described in more detail in the official jinja2 documentation, a meta-templating extension is also provided. This meta-templating allows you to reference other reference other templates. For example, if you had two SQL templates named ‘template_a’ and ‘template_b’, then you could render them into one SQL query using (for example):

.template_render('''
WITH
    a AS (
        {{{template_a}}}
    ),
    b AS (
        {{{template_b}}}
    )
SELECT *
FROM a
JOIN b ON a.x = b.x
''')

Note that template substitution in this way is iterative, so you can chain template embedding, provided that such embedding is not recursive.

Parameters:
  • name_or_statement (str) – The name of a template (if by_name is True) or else a string representation of a jinja2 template.
  • context (dict, None) – A dictionary to use as the template context. If not specified, an empty dictionary is used.
  • by_name (bool) – True if name_or_statement should be interpreted as a template name, or False (default) if name_or_statement should be interpreted as a template body.
  • cleanup (bool) – True if the rendered statement should be formatted, False (default) otherwise
  • meta_only (bool) – True if rendering should only progress as far as rendering nested templates (i.e. don’t actually substitute in variables from the context); False (default) otherwise.
Returns:

The rendered template.

Return type:

str

execute_from_template(name, context=None, **kwargs)[source]

Render and then execute a named template.

Parameters:
  • name (str) – The name of the template to be rendered and executed.
  • context (dict) – The context in which the template should be rendered.
  • **kwargs (dict) – Additional parameters to pass to .execute().
Returns:

A DBAPI2 compatible cursor instance.

Return type:

DBAPI2 cursor

query_from_template(name, context=None, **kwargs)[source]

Render and then query using a named tempalte.

Parameters:
  • name (str) – The name of the template to be rendered and used to query the database.
  • context (dict) – The context in which the template should be rendered.
  • **kwargs (dict) – Additional parameters to pass to .query().
Returns:

The results of the query formatted as nominated.

Return type:

object

query_to_table(statement, table, if_exists='fail', **kwargs)[source]

Run a query and store the results in a table in this database.

Parameters:
  • statement – The statement to be executed.
  • table (str) – The name of the table into which the dataframe should be uploaded.
  • if_exists (str) – if nominated table already exists: ‘fail’ to do nothing, ‘replace’ to drop, recreate and insert data into new table, and ‘append’ to add data from this table into the existing table.
  • **kwargs (dict) – Additional keyword arguments to pass onto DatabaseClient._query_to_table.
Returns:

The cursor object associated with the execution.

Return type:

DB-API cursor

dataframe_to_table(df, table, if_exists='fail', **kwargs)[source]

Upload a local pandas dataframe into a table in this database.

Parameters:
  • df (pandas.DataFrame) – The dataframe to upload into the database.
  • table (str, ParsedNamespaces) – The name of the table into which the dataframe should be uploaded.
  • if_exists (str) – if nominated table already exists: ‘fail’ to do nothing, ‘replace’ to drop, recreate and insert data into new table, and ‘append’ to add data from this table into the existing table.
  • **kwargs (dict) – Additional keyword arguments to pass onto DatabaseClient._dataframe_to_table.
table_list(namespace=None, renew=True, **kwargs)[source]

Return a list of table names in the data source as a DataFrame.

Parameters:
  • namespace (str) – The namespace in which to look for tables.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

The names of schemas in this database.

Return type:

list<str>

table_exists(table, renew=True, **kwargs)[source]

Check whether a table exists.

Parameters:
  • table (str) – The table for which to check.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

True if table exists, and False otherwise.

Return type:

bool

table_drop(table, **kwargs)[source]

Remove a table from the database.

Parameters:
  • table (str) – The table to drop.
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

The cursor associated with this execution.

Return type:

DB-API cursor

table_desc(table, renew=True, **kwargs)[source]

Describe a table in the database.

Parameters:
  • table (str) – The table to describe.
  • renew (bool) – Whether to renew the results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A dataframe description of the table.

Return type:

pandas.DataFrame

table_partition_cols(table, renew=True, **kwargs)[source]

Extract the columns by which a table is partitioned (if database supports partitions).

Parameters:
  • table (str) – The table from which to extract data.
  • renew (bool) – Whether to renew the results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A list of columns by which table is partitioned.

Return type:

list<str>

table_head(table, n=10, renew=True, **kwargs)[source]

Retrieve the first n rows from a table.

Parameters:
  • table (str) – The table from which to extract data.
  • n (int) – The number of rows to extract.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A dataframe representation of the first n rows

of the nominated table.

Return type:

pandas.DataFrame

table_props(table, renew=True, **kwargs)[source]

Retrieve the properties associated with a table.

Parameters:
  • table (str) – The table from which to extract data.
  • renew (bool) – Whether to renew the table list or use cached results (default: True).
  • **kwargs (dict) – Additional arguments passed through to implementation.
Returns:

A dataframe representation of the table

properties.

Return type:

pandas.DataFrame

connect()

Connect to the service backing this client.

It is not normally necessary for a user to manually call this function, since when a connection is required, it is automatically created.

Returns:A reference to the current object.
Return type:Duct instance
disconnect()

Disconnect this client from backing service.

This method is automatically called during reconnections and/or at Python interpreter shutdown. It first calls Duct._disconnect (which should be implemented by subclasses) and then notifies the RemoteClient subclass, if present, to stop port-forwarding the remote service.

Returns:A reference to this object.
Return type:Duct instance
is_connected()

Check whether this Duct instances is currently connected.

This method checks to see whether a Duct instance is currently connected. This is performed by verifying that the remote host and port are still accessible, and then by calling Duct._is_connected, which should be implemented by subclasses.

Returns:Whether this Duct instance is currently connected.
Return type:bool
prepare()

Prepare a Duct subclass for use (if not already prepared).

This method is called before the value of any of the fields referenced in self.connection_fields are retrieved. The fields include, by default: ‘host’, ‘port’, ‘remote’, ‘cache’, ‘username’, and ‘password’. Subclasses may add or subtract from these special fields.

When called, it first checks whether the instance has already been prepared, and if not calls _prepare and then records that the instance has been successfully prepared.

DatabaseClient Quirks:

This method may be overridden by subclasses, but provides the following default behaviour:

  • Ensures self.registry, self.remote and self.cache values are instances of the right types.
  • It replaces string values of self.remote and self.cache with remotes and caches looked up using self.registry.lookup.
  • It looks through each of the fields nominated in self.prepared_fields and, if the corresponding value is callable, sets the value of that field to result of calling that value with a reference to self. By default, prepared_fields contains ‘_host’, ‘_port’, ‘_username’, and ‘_password’.
  • Ensures value of self.port is an integer (or None).
register_magics(base_name=None)

The following magic functions will be registered (assuming that the base name is chosen to be ‘hive’): - Cell Magics:

  • %%hive: For querying the database.
  • %%hive.execute: For executing a statement against the database.
  • %%hive.stream: For executing a statement against the database,
    and streaming the results.
  • %%hive.template: The defining a new template.
  • %%hive.render: Render a provided query statement.
  • Line Magics:
    • %hive: For querying the database using a named template.
    • %hive.execute: For executing a named template statement against
      the database.
    • %hive.stream: For executing a named template against the database,
      and streaming the results.
    • %hive.render: Render a provided a named template.
    • %hive.desc: Describe the table nominated.
    • %hive.head: Return the first rows in a specified table.
    • %hive.props: Show the properties specified for a nominated table.

Documentation for these magics is provided online.

IPython Magics

While it is possible in an IPython/Jupter notebook session to write code along the lines of:

results = db_client.query("""
SELECT *
FROM table
WHERE condition = 1
""", format='pandas', ...)

manually encapsulating queries in quotes quickly becomes tiresome and cluttered. We therefore expose most functionality as IPython magic functions. For example, the above code could instad be rendered (assuming magic functions have been registered under the name db_client):

%%db_client results format='pandas' ...
SELECT *
FROM table
WHERE condition = 1

Especially when combined with templating, this can greatly improve the readability of your code.

In the following, all of the provided magic functions are listed along with the equivalent programmatic code. Note that all arguments are passed in as space-separated tokens after the magic’s name. Position-arguments are always interpreted as strings and keyword arguments are expected to be provided in the form ‘<key>=<value>’, where the <value> will be run as Python code and the resulting value passed on to the underlying function/method as:

db_client.method(..., key=eval('<value>'), ...)

Where present in the following, arguments in square brackets after the magic name are the options specific to the magic function, and an ellipsis (’…’) indicates that any additional keyword arguments will be passed on to the appropriate method.

Querying

%%<name> [variable=None show='head' transpose=False ...]
SELECT *
FROM table
WHERE condition = 1

This magic is equivalent to calling db_client.query("<sql>", ...), with the following magic-specific parameters offering additional flexibility:

  • variable (str):
    The name of the local variable where the output should be stored (typically not referenced directly by name)
  • show (str, int, None):
    What should be shown if variable is specified (if not the entire output is returned). Allowed values are ‘all’, ‘head’ (first 5 rows), ‘none’, or an integer which specifies the number of rows to be shown.
  • transpose (bool):
    If format is pandas, whether the shown results, as defined above, should be transposed. Data stored into variable is never transposed.

There is also a line-magic version if you are querying using an existing template:

results = %<name> variable='<template_name>' ...

which is equivalent to db_client.query_from_template('<template_name>', context=locals()). Note that one would typically pass this the template name as a position argument, i.e. %<name> <template_name>.

Executing

%%<name>.execute [variable=None ...]
INSERT INTO database.table (field1, field2) VALUES (1, 2);

This magic is equivalent to db_client.execute('<sql>', ...), with the variable argument functioning as previously for the query magic.

As for the query magic, there is also a template version:

Streaming

%%<name>.stream [variable=None ...]
SELECT *
FROM table
WHERE condition = 1

This magic is equivalent to db_client.stream('<sql>', ...), with the variable argument functioning as previously for the query magic. Keep in mind that the value returned from this method is a generator object.

As for the query magic, there is also a template version:

Templating

To create a new template:

%%<name>.template <template_name>
SELECT *
FROM table
WHERE condition = 1

which is equivalent to db_client.add_template("<template_name>", "<sql>").

You can render a template in the cell body using current context (or specified context):

%%<name>.render [context=None, show=True]
SELECT 1 FROM test

or if the template has already been created, you can render it directly by name:

%<name>.render [name=None, context=None, show=True]

In both cases, the context and show parameters respectively control the context from which template variables are extracted and whether the rendered template should be shown (printed to screen) or returned as a string.

Table properties

todo:Resolve what to keep and dump here.
%%<name>.desc
SELECT 1 FROM test
%%<name>.head
SELECT 1 FROM test
%%<name>.props

Subclass Reference

For comprehensive documentation on any particular subclass, please refer to one of the below documents.