UDF and Data Processing Operator

Pig Latin also provides three statements—REGISTER, DEFINE, and IMPORT—that make it possible to incorporate macros and user-defined functions into Pig scripts. REGISTER, Registers a JAR file with the Pig runtime, DEFINE, Creates an alias for a macro, UDF, streaming script, or command specification. IMPORT, Imports macros defined in a separate file into a script.

The DEFINE statement is used to assign an alias to an external executable or a UDF function. Use this statement if you want to have a crisp name for a function that has a lengthy package name.

For a STREAM command, DEFINE plays an important role to transfer the executable to the task nodes of the Hadoop cluster. This is accomplished using the SHIP clause of the DEFINE operator.

UDFs

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, JavaScript and Ruby.

Registering UDFs

Registering Java UDFs:

—register_java_udf.pig

register ‘your_path_to_piggybank/piggybank.jar’;

divs  = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray,

date:chararray, dividends:float);

Registering Python UDFs (The Python script must be in your current directory):

–register_python_udf.pig

register ‘production.py’ using jython as bballudfs;

players = load ‘baseball’ as (name:chararray, team:chararray,

pos:bag{t:(p:chararray)}, bat:map[]);

Writing UDFs

Java UDFs:

package myudfs;

import java.io.IOException;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

public class UPPER extends EvalFunc

{

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0)

return null;

try{

String str = (String)input.get(0);

return str.toUpperCase();

}catch(Exception e){

throw new IOException(“Caught exception processing input row “, e);

}

}

}

Python UDFs

#Square – Square of a number of any data type

@outputSchemaFunction(“squareSchema”) — Defines a script delegate function that defines schema for this function depending upon the input type.

def square(num):

return ((num)*(num))

@schemaFunction(“squareSchema”) –Defines delegate function and is not registered to Pig.

def squareSchema(input):

return input

#Percent- Percentage

@outputSchema(“percent:double”) –Defines schema for a script UDF in a format that Pig understands and is able to parse

def percent(num, total):

return num * 100 / total

Pig’s Extensibility

There are three ways to incorporate external custom code in Pig scripts.

  • REGISTER : The UDFs provide one avenue to include the user code. To use the UDF written in Java, Python, JRuby, or Groovy, we use the REGISTER function in the Pig script to register the container (JAR and Python script).   To register a Python UDF, you also need to explicitly provide which compiler the Python script will be using. This can be done using Jython.
  • MAPREDUCE: This operator is used to embed MapReduce jobs in Pig scripts. We need to specify the MapReduce container JAR along with the inputs and outputs for the MapReduce program.
  • STREAM: This allows data to be sent to an external executable for processing as part of a Pig data processing pipeline. You can intermix relational operations, such as grouping and filtering with custom or legacy executables. This is especially useful in cases where the executable has all the custom code, and you may not want to change the code and rewrite it in Pig. The external executable receives its input from a standard input or file, and writes its output either to a standard output or file.

Get industry recognized certification – Contact us

Menu