PDF Extractor

It lets you extract data from a PDF file using an existing PDF template created in Automation Studio. See PDF Template Creator activity to know how to create a PDF template in Automation Studio.

The dynamic PDF extraction capability allows you to automate actions on complex PDF documents. Advanced PDF controls and OCR capabilities extract information from PDF documents faster and with improved data quality and accuracy.

Using File Read Activity

  1. In the Canvas Tools pane, click File to expand the tool and view the associated activities.
  2. Drag the PDF Extractor activity and drop on to the Flowchart designer on the Canvas.

 

 

  1. Click the PDF Location field and browse for the PDF file from where you want to extract data. The selected PDF file must be similar to the PDF template or an error is received.
    • Instead of passing the PDF file to the activity as a default file, you can pass the PDF file as a parameter. Create a parameter in the Parameter pane, and assign the PDF file path (along with file name and file extension). In the Properties grid, enter the parameter name in the FileName field property.
  2. Click the Template Location field and browse for the PDF template created in Automation Studio. By default, the template is saved in the %localappdata% > EdgeVerve> AutomationStudio > ProtonFiles> PdfRepository folder.
    The PDF Extractor activity with the default name is created.
    You can perform test run to view the extracted data. A message for successful data extraction is displayed in the Output console of Automation Studio.



    The extracted data is saved in a .CSV file at %localappdata% > EdgeVerve> AutomationStudio folder for further processing. You can delete the file, if required.

PDF Extractor Properties

The properties of PDF Extractor activity are listed in the following table and can be edited in the Properties grid on the right pane.

 

Property Name

Usage

Control Execution

Ignore Error

When this option is set to Yes, the application ignores any error while executing the activity.

If set to NA, it bypasses the exception (if any) to let the automation flow continue; however, it marks the automation status as failure, in case of an exception.

By default, this option is set to No.

Delay

Wait After

Specify the time delay that must occur after the activity is executed. The value must be in milliseconds.

Wait Before

Specify the time delay that must occur before the activity is executed. The value must be in milliseconds.

Misc

Breakpoint

Select this option to mark this activity as a pause point while debugging the process. At this point, the process freezes during execution allowing you to examine if the process is functioning as expected.

In large or complex processes, breakpoints help in identifying the error, if any.

Compare Result

Compares the data extracted from the scanned PDF file with the original document in a comparison view in the Automation Studio.  By default, it is not selected.

Commented

Select this option to mark this activity as inactive in the entire process. When an activity is commented, it is ignored during the process execution.

DisplayName

The display name of the activity in the flowchart designer area. By default, the name is set as PDF Extractor. You can change the name as required.

FileName

The path of the PDF file which you want to use for data extraction. You can enter a pre-defined parameter in this field to pass the PDF file as a parameter and not the default file.

FolderPath

The location of the folder where the excel file needs to be created to save the extracted data. By default, the excel file gets created in the %localappdata% > EdgeVerve> AutomationStudio > ProtonFiles> PdfRepository folder. You can specify a folder location of your choice to over write the default location.

PageRange

The range of pages that you want to retrieve. For specifying a single page, enter the page number in double quotes, for example, "5". You can specify a range of pages by providing the range in double quotes, for example, "2-5", or "All" for all the pages. Only string types are supported. By default, it is cleared.

TemplateName

The name of the configured PDF template to extract data. Alternatively, you can select the required template in the Template Location field of the activity block. The template selected in the Properties grid reflects in the activity block and vice versa.