Basehook Airflow

broken image


  1. Airflow Basehook Connection
  2. Airflow Basehook Get_connection
  3. Airflow Basehook Example
  4. Airflow Basehook Github

BaseHook (source) source ¶ Bases: airflow.utils.log.loggingmixin.LoggingMixin Abstract base class for hooks, hooks are meant as an interface to interact with external systems.

Module Contents¶

  1. See the License for the # specific language governing permissions and limitations # under the License. From past.builtins import unicode import cloudant from airflow.exceptions import AirflowException from airflow.hooks.basehook import BaseHook from airflow.utils.log.loggingmixin import LoggingMixin.
  2. Source code for airflow.hooks.dbapi # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership.
  3. Exceptions import AirflowException: from airflow. Base import BaseHook: class AzureBaseHook (BaseHook): ' This hook acts as a base hook for azure services. It offers several authentication mechanisms to: authenticate the client library used for upstream azure hooks.:param sdkclient: The SDKClient to use.
airflow.hooks.hive_hooks.HIVE_QUEUE_PRIORITIES = ['VERY_HIGH', 'HIGH', 'NORMAL', 'LOW', 'VERY_LOW'][source]
airflow.hooks.hive_hooks.get_context_from_env_var()[source]
Extract context from env variable, e.g. dag_id, task_id and execution_date,
so that they can be used inside BashOperator and PythonOperator.
Returns

The context of interest.

class airflow.hooks.hive_hooks.HiveCliHook(hive_cli_conn_id='hive_cli_default', run_as=None, mapred_queue=None, mapred_queue_priority=None, mapred_job_name=None)[source]

Bases: airflow.hooks.base_hook.BaseHook

Simple wrapper around the hive CLI.

It also supports the beelinea lighter CLI that runs JDBC and is replacing the heaviertraditional CLI. To enable beeline, set the use_beeline param in theextra field of your connection as in {'use_beeline':true}

Note that you can also set default hive CLI parameters using thehive_cli_params to be used in your connection as in{'hive_cli_params':'-hiveconfmapred.job.tracker=some.jobtracker:444'}Parameters passed here can be overridden by run_cli's hive_conf param

The extra connection parameter auth gets passed as in the jdbc Twitch beatport. connection string as is.

Parameters
  • mapred_queue (str) – queue used by the Hadoop Scheduler (Capacity or Fair)

  • mapred_queue_priority (str) – priority within the job queue.Possible settings include: VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

  • mapred_job_name (str) – This name will appear in the jobtracker.This can make monitoring easier.

_get_proxy_user(self)[source]

This function set the proper proxy_user value in case the user overwtire the default.

_prepare_cli_cmd(self)[source]

This function creates the command list from available information

static _prepare_hiveconf(d)[source]

This function prepares a list of hiveconf paramsfrom a dictionary of key value pairs.

Parameters

d (dict) –

run_cli(self, hql, schema=None, verbose=True, hive_conf=None)[source]

Run an hql statement using the hive cli. If hive_conf is specifiedit should be a dict and the entries will be set as key/value pairsin HiveConf

Parameters

hive_conf (dict) – if specified these key value pairs will be passedto hive as -hiveconf'key'='value'. Note that they will bepassed after the hive_cli_params and thus will overridewhatever values are specified in the database.

test_hql(self, hql)[source]

Test an hql statement using the hive cli and EXPLAIN

load_df(self, df, table, field_dict=None, delimiter=', ', encoding='utf8', pandas_kwargs=None, **kwargs)[source]

Loads a pandas DataFrame into hive.

Hive data types will be inferred if not passed but column names willnot be sanitized.

Parameters
  • df (pandas.DataFrame) – DataFrame to load into a Hive table

  • table (str) – target Hive table, use dot notation to target aspecific database

  • field_dict (collections.OrderedDict) – mapping from column name to hive data type.Note that it must be OrderedDict so as to keep columns' order.

  • delimiter (str) – field delimiter in the file

  • encoding (str) – str encoding to use when writing DataFrame to file

  • pandas_kwargs (dict) – passed to DataFrame.to_csv

  • kwargs – passed to self.load_file

load_file(self, filepath, table, delimiter=', ', field_dict=None, create=True, overwrite=True, partition=None, recreate=False, tblproperties=None)[source]

Loads a local file into Hive

Note that the table generated in Hive uses STOREDAStextfilewhich isn't the most efficient serialization format. If alarge amount of data is loaded and/or if the tables getsqueried considerably, you may want to use this operator only tostage the data into a temporary table before loading it into itsfinal destination using a HiveOperator.

Parameters
  • filepath (str) – local filepath of the file to load

  • table (str) – target Hive table, use dot notation to target aspecific database

  • delimiter (str) – field delimiter in the file

  • field_dict (collections.OrderedDict) – A dictionary of the fields name in the fileas keys and their Hive types as values.Note that it must be OrderedDict so as to keep columns' order.

  • create (bool) – whether to create the table if it doesn't exist

  • overwrite (bool) – whether to overwrite the data in table or partition

  • partition (dict) – target partition as a dict of partition columnsand values

  • recreate (bool) – whether to drop and recreate the table at everyexecution

  • tblproperties (dict) – TBLPROPERTIES of the hive table being created

kill(self)[source]
class airflow.hooks.hive_hooks.HiveMetastoreHook(metastore_conn_id='metastore_default')[source]

Bases: airflow.hooks.base_hook.BaseHook

Wrapper to interact with the Hive Metastore

MAX_PART_COUNT = 32767[source]
__getstate__(self)[source]
__setstate__(self, d)[source]
get_metastore_client(self)[source]

Returns a Hive thrift client.

get_conn(self)[source]
Basehook Airflow
check_for_partition(self, schema, table, partition)[source]

Checks whether a partition exists

Parameters
  • schema (str) – Name of hive schema (database) @table belongs to

  • table – Name of hive table @partition belongs to

Partition

Expression that matches the partitions to check for(eg a = ‘b' AND c = ‘d')

Return type
check_for_named_partition(self, schema, table, partition_name)[source]

Checks whether a partition with a given name exists

Parameters
  • schema (str) – Name of hive schema (database) @table belongs to

  • table – Name of hive table @partition belongs to

Partition

Name of the partitions to check for (eg a=b/c=d)

Return type
get_table(self, table_name, db='default')[source]

Get a metastore table object

get_tables(self, db, pattern='*')[source]

Get a metastore table object Garmin gtu 10 alternative.

get_databases(self, pattern='*')[source]

Get a metastore table object

get_partitions(self, schema, table_name, filter=None)[source]

Returns a list of all partitions in a table. Works onlyfor tables with less than 32767 (java short max val).For subpartitioned table, the number might easily exceed this.

static _get_max_partition_from_part_specs(part_specs, partition_key, filter_map)[source]

Helper method to get max partition of partitions with partition_keyfrom part specs. key:value pair in filter_map will be used tofilter out partitions.

Parameters
  • part_specs (list) – list of partition specs.

  • partition_key (str) – partition key name.

  • filter_map (map) – partition_key:partition_value map used for partition filtering,e.g. {‘key1': ‘value1', ‘key2': ‘value2'}.Only partitions matching all partition_key:partition_valuepairs will be considered as candidates of max partition.

Returns

Max partition or None if part_specs is empty.

max_partition(self, schema, table_name, field=None, filter_map=None)[source]

Returns the maximum value for all partitions with given field in a table.If only one partition key exist in the table, the key will be used as field.filter_map should be a partition_key:partition_value map and will be used tofilter out partitions.

Parameters
  • schema (str) – schema name.

  • table_name (str) – table name.

  • field (str) – partition key to get max partition from.

  • filter_map (map) – partition_key:partition_value map used for partition filtering.

table_exists(self, table_name, db='default')[source]

Check if table exists

class airflow.hooks.hive_hooks.HiveServer2Hook(hiveserver2_conn_id='hiveserver2_default')[source]

Bases: airflow.hooks.base_hook.BaseHook

Wrapper around the pyhive library

Notes:* the default authMechanism is PLAIN, to override it youcan specify it in the extra of your connection in the UI* the default for run_set_variable_statements is true, if youare using impala you may need to set it to false in theextra of your connection in the UI

get_conn(self, schema=None)[source]

Returns a Hive connection object.

_get_results(self, hql, schema='default', fetch_size=None, hive_conf=None)[source]
get_results(self, hql, schema='default', fetch_size=None, hive_conf=None)[source]
Airflow

Get results of the provided hql in target schema.

Parameters
  • hql (str or list) – hql to be executed.

  • schema (str) – target schema, default to ‘default'.

  • fetch_size (int Balcao pia kapesberg. ) – max size of result to fetch.

  • hive_conf (dict) – hive_conf to execute alone with the hql.

Returns

results of hql execution, dict with data (list of results) and header

Return type
to_csv(self, hql, csv_filepath, schema='default', delimiter=', ', lineterminator='rn', output_header=True, fetch_size=1000, hive_conf=None)[source]

Execute hql in target schema and write results to a csv file.

Parameters
  • hql (str or list) – hql to be executed.

  • csv_filepath (str) – filepath of csv to write results into.

  • schema (str) – target schema, default to ‘default'.

  • delimiter (str) – delimiter of the csv file, default to ‘,'.

  • lineterminator (str) – lineterminator of the csv file.

  • output_header (bool) – header of the csv file, default to True.

  • fetch_size (int) – number of result rows to write into the csv file, default to 1000.

  • hive_conf (dict) – hive_conf to execute alone with the hql.

get_records(self, hql, schema='default')[source]

Get a set of records from a Hive query.

Parameters
  • hql (str or list) – hql to be executed.

  • schema (str) – target schema, default to ‘default'.

  • hive_conf (dict) – hive_conf to execute alone with the hql.

Returns

result of hive execution

Return type
get_pandas_df(self, hql, schema='default')[source]

Get a pandas dataframe from a Hive query

Parameters
  • hql (str or list) – hql to be executed.

  • schema (str) – target schema, default to ‘default'.

Returns

result of hql execution

Return type

DataFrame

Returns

pandas.DateFrame

Latest version

Released:

A Pylint plugin to lint Apache Airflow code.

Project description

Pylint plugin for static code analysis on Airflow code.

Usage

Installation:

Usage:

This plugin runs on Python 3.6 and higher.

Error codes

The Pylint-Airflow codes follow the structure {I,C,R,W,E,F}83{0-9}{0-9}, where:

  • The characters show:
    • I = Info
    • C = Convention
    • R = Refactor
    • W = Warning
    • E = Error
    • F = Fatal
  • 83 is the base id (see all here https://github.com/PyCQA/pylint/blob/master/pylint/checkers/__init__.py)
  • {0-9}{0-9} is any number 00-99

The current codes are:

CodeSymbolDescription
C8300different-operator-varname-taskidFor consistency assign the same variable name and task_id to operators.
C8301match-callable-taskidFor consistency name the callable function ‘_[task_id]', e.g. PythonOperator(task_id='mytask', python_callable=_mytask).
C8302mixed-dependency-directionsFor consistency don't mix directions in a single statement, instead split over multiple statements.
C8303task-no-dependenciesSometimes a task without any dependency is desired, however often it is the result of a forgotten dependency.
C8304task-context-argnameIndicate you expect Airflow task context variables in the **kwargs argument by renaming to **context.
C8305task-context-separate-argTo avoid unpacking kwargs from the Airflow task context in a function, you can set the needed variables as arguments in the function.
C8306match-dagid-filenameFor consistency match the DAG filename with the dag_id.
R8300unused-xcomReturn values from a python_callable function or execute() method are automatically pushed as XCom.
W8300basehook-top-levelAirflow executes DAG scripts periodically and anything at the top level of a script is executed. Therefore, move BaseHook calls into functions/hooks/operators.
E8300duplicate-dag-nameDAG name should be unique.
E8301duplicate-task-nameTask name within a DAG should be unique.
E8302duplicate-dependencyTask dependencies can be defined only once.
E8303dag-with-cyclesA DAG is acyclic and cannot contain cycles.
E8304task-no-dagA task must know a DAG instance to run.
Airflow basehook source code

Documentation

Documentation is available on Read the Docs.

Contributing

Suggestions for more checks are always welcome, please create an issue on GitHub. Read CONTRIBUTING.rst for more details.

Release historyRelease notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pylint-airflow, version 0.1.0a1
Filename, sizeFile typePython versionUpload dateHashes
Filename, size pylint_airflow-0.1.0a1-py3-none-any.whl (9.5 kB) File type Wheel Python version py3 Upload dateHashes
Filename, size pylint-airflow-0.1.0a1.tar.gz (7.0 kB) File type Source Python version None Upload dateHashes
Airflow
check_for_partition(self, schema, table, partition)[source]

Checks whether a partition exists

Parameters
  • schema (str) – Name of hive schema (database) @table belongs to

  • table – Name of hive table @partition belongs to

Partition

Expression that matches the partitions to check for(eg a = ‘b' AND c = ‘d')

Return type
check_for_named_partition(self, schema, table, partition_name)[source]

Checks whether a partition with a given name exists

Parameters
  • schema (str) – Name of hive schema (database) @table belongs to

  • table – Name of hive table @partition belongs to

Partition

Name of the partitions to check for (eg a=b/c=d)

Return type
get_table(self, table_name, db='default')[source]

Get a metastore table object

get_tables(self, db, pattern='*')[source]

Get a metastore table object Garmin gtu 10 alternative.

get_databases(self, pattern='*')[source]

Get a metastore table object

get_partitions(self, schema, table_name, filter=None)[source]

Returns a list of all partitions in a table. Works onlyfor tables with less than 32767 (java short max val).For subpartitioned table, the number might easily exceed this.

static _get_max_partition_from_part_specs(part_specs, partition_key, filter_map)[source]

Helper method to get max partition of partitions with partition_keyfrom part specs. key:value pair in filter_map will be used tofilter out partitions.

Parameters
  • part_specs (list) – list of partition specs.

  • partition_key (str) – partition key name.

  • filter_map (map) – partition_key:partition_value map used for partition filtering,e.g. {‘key1': ‘value1', ‘key2': ‘value2'}.Only partitions matching all partition_key:partition_valuepairs will be considered as candidates of max partition.

Returns

Max partition or None if part_specs is empty.

max_partition(self, schema, table_name, field=None, filter_map=None)[source]

Returns the maximum value for all partitions with given field in a table.If only one partition key exist in the table, the key will be used as field.filter_map should be a partition_key:partition_value map and will be used tofilter out partitions.

Parameters
  • schema (str) – schema name.

  • table_name (str) – table name.

  • field (str) – partition key to get max partition from.

  • filter_map (map) – partition_key:partition_value map used for partition filtering.

table_exists(self, table_name, db='default')[source]

Check if table exists

class airflow.hooks.hive_hooks.HiveServer2Hook(hiveserver2_conn_id='hiveserver2_default')[source]

Bases: airflow.hooks.base_hook.BaseHook

Wrapper around the pyhive library

Notes:* the default authMechanism is PLAIN, to override it youcan specify it in the extra of your connection in the UI* the default for run_set_variable_statements is true, if youare using impala you may need to set it to false in theextra of your connection in the UI

get_conn(self, schema=None)[source]

Returns a Hive connection object.

_get_results(self, hql, schema='default', fetch_size=None, hive_conf=None)[source]
get_results(self, hql, schema='default', fetch_size=None, hive_conf=None)[source]

Get results of the provided hql in target schema.

Parameters
  • hql (str or list) – hql to be executed.

  • schema (str) – target schema, default to ‘default'.

  • fetch_size (int Balcao pia kapesberg. ) – max size of result to fetch.

  • hive_conf (dict) – hive_conf to execute alone with the hql.

Returns

results of hql execution, dict with data (list of results) and header

Return type
to_csv(self, hql, csv_filepath, schema='default', delimiter=', ', lineterminator='rn', output_header=True, fetch_size=1000, hive_conf=None)[source]

Execute hql in target schema and write results to a csv file.

Parameters
  • hql (str or list) – hql to be executed.

  • csv_filepath (str) – filepath of csv to write results into.

  • schema (str) – target schema, default to ‘default'.

  • delimiter (str) – delimiter of the csv file, default to ‘,'.

  • lineterminator (str) – lineterminator of the csv file.

  • output_header (bool) – header of the csv file, default to True.

  • fetch_size (int) – number of result rows to write into the csv file, default to 1000.

  • hive_conf (dict) – hive_conf to execute alone with the hql.

get_records(self, hql, schema='default')[source]

Get a set of records from a Hive query.

Parameters
  • hql (str or list) – hql to be executed.

  • schema (str) – target schema, default to ‘default'.

  • hive_conf (dict) – hive_conf to execute alone with the hql.

Returns

result of hive execution

Return type
get_pandas_df(self, hql, schema='default')[source]

Get a pandas dataframe from a Hive query

Parameters
  • hql (str or list) – hql to be executed.

  • schema (str) – target schema, default to ‘default'.

Returns

result of hql execution

Return type

DataFrame

Returns

pandas.DateFrame

Latest version

Released:

A Pylint plugin to lint Apache Airflow code.

Project description

Pylint plugin for static code analysis on Airflow code.

Usage

Installation:

Usage:

This plugin runs on Python 3.6 and higher.

Error codes

The Pylint-Airflow codes follow the structure {I,C,R,W,E,F}83{0-9}{0-9}, where:

  • The characters show:
    • I = Info
    • C = Convention
    • R = Refactor
    • W = Warning
    • E = Error
    • F = Fatal
  • 83 is the base id (see all here https://github.com/PyCQA/pylint/blob/master/pylint/checkers/__init__.py)
  • {0-9}{0-9} is any number 00-99

The current codes are:

CodeSymbolDescription
C8300different-operator-varname-taskidFor consistency assign the same variable name and task_id to operators.
C8301match-callable-taskidFor consistency name the callable function ‘_[task_id]', e.g. PythonOperator(task_id='mytask', python_callable=_mytask).
C8302mixed-dependency-directionsFor consistency don't mix directions in a single statement, instead split over multiple statements.
C8303task-no-dependenciesSometimes a task without any dependency is desired, however often it is the result of a forgotten dependency.
C8304task-context-argnameIndicate you expect Airflow task context variables in the **kwargs argument by renaming to **context.
C8305task-context-separate-argTo avoid unpacking kwargs from the Airflow task context in a function, you can set the needed variables as arguments in the function.
C8306match-dagid-filenameFor consistency match the DAG filename with the dag_id.
R8300unused-xcomReturn values from a python_callable function or execute() method are automatically pushed as XCom.
W8300basehook-top-levelAirflow executes DAG scripts periodically and anything at the top level of a script is executed. Therefore, move BaseHook calls into functions/hooks/operators.
E8300duplicate-dag-nameDAG name should be unique.
E8301duplicate-task-nameTask name within a DAG should be unique.
E8302duplicate-dependencyTask dependencies can be defined only once.
E8303dag-with-cyclesA DAG is acyclic and cannot contain cycles.
E8304task-no-dagA task must know a DAG instance to run.

Documentation

Documentation is available on Read the Docs.

Contributing

Suggestions for more checks are always welcome, please create an issue on GitHub. Read CONTRIBUTING.rst for more details.

Release historyRelease notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pylint-airflow, version 0.1.0a1
Filename, sizeFile typePython versionUpload dateHashes
Filename, size pylint_airflow-0.1.0a1-py3-none-any.whl (9.5 kB) File type Wheel Python version py3 Upload dateHashes
Filename, size pylint-airflow-0.1.0a1.tar.gz (7.0 kB) File type Source Python version None Upload dateHashes

Airflow Basehook Connection

Close

Hashes for pylint_airflow-0.1.0a1-py3-none-any.whl

Airflow Basehook Get_connection

Hashes for pylint_airflow-0.1.0a1-py3-none-any.whl
AlgorithmHash digest
SHA256530a43e902831f2806dc3f672c7eb46ac1378106c400a04dce2ea1ccd5c4d0c3
MD5c8fba07d31ad7d45b16f91212414c971
BLAKE2-2565f795fff2161bfecfc249eb8050685e2403488d83290ac4e6237781176f3968e

Airflow Basehook Example

Close

Hashes for pylint-airflow-0.1.0a1.tar.gz

Airflow Basehook Github

Hashes for pylint-airflow-0.1.0a1.tar.gz
AlgorithmHash digest
SHA256a0e3edb13932f35b24b24e66ebe596541600575b755b78a2a33d74d8bf94b620
MD5d8d8fa0b1ad4f553a1cfd2ab59141e51
BLAKE2-256ff55619e9e6b5710eee8439c725ab18e10ca6da47d1c6b90c7a43770debd1915




broken image