PDF Extractor
It lets you extract data from a PDF file using an existing PDF template created in Automation Studio. See PDF Template Creator activity to know how to create a PDF template in Automation Studio.
The dynamic PDF extraction capability allows you to automate actions on complex PDF documents. Advanced PDF controls and OCR capabilities extract information from PDF documents faster and with improved data quality and accuracy.
Using File Read Activity
- In the Canvas Tools pane, click File to expand the tool and view the associated activities.
- Drag the PDF Extractor activity and drop on to the Flowchart designer on the Canvas.
- Click the PDF Location field and browse for the PDF file from where you want to extract data. The selected PDF file must be similar to the PDF template or an error is received.
- Instead of passing the PDF file to the activity as a default file, you can pass the PDF file as a parameter. Create a parameter in the Parameter pane, and assign the PDF file path (along with file name and file extension). In the Properties grid, enter the parameter name in the FileName field property.
- Click the Template Location field and browse for the PDF template created in Automation Studio. By default, the template is saved in the %localappdata% > EdgeVerve> AutomationStudio > ProtonFiles> PdfRepository folder.
The PDF Extractor activity with the default name is created.
You can perform test run to view the extracted data. A message for successful data extraction is displayed in the Output console of Automation Studio.
The extracted data is saved in a .CSV file at %localappdata% > EdgeVerve> AutomationStudio folder for further processing. You can delete the file, if required.
PDF Extractor Properties
The properties of PDF Extractor activity are listed in the following table and can be edited in the Properties grid on the right pane.
Property Name |
Usage |
Control Execution |
|
Ignore Error |
When this option is set to Yes, the application ignores any error while executing the activity. If set to NA, it bypasses the exception (if any) to let the automation flow continue; however, it marks the automation status as failure, in case of an exception. By default, this option is set to No. |
Delay |
|
Wait After |
Specify the time delay that must occur after the activity is executed. The value must be in milliseconds. |
Wait Before |
Specify the time delay that must occur before the activity is executed. The value must be in milliseconds. |
Misc |
|
Breakpoint |
Select this option to mark this activity as a pause point while debugging the process. At this point, the process freezes during execution allowing you to examine if the process is functioning as expected. In large or complex processes, breakpoints help in identifying the error, if any. |
Compare Result |
Compares the data extracted from the scanned PDF file with the original document in a comparison view in the Automation Studio. By default, it is not selected. |
Commented |
Select this option to mark this activity as inactive in the entire process. When an activity is commented, it is ignored during the process execution. |
DisplayName |
The display name of the activity in the flowchart designer area. By default, the name is set as PDF Extractor. You can change the name as required. |
FileName |
The path of the PDF file which you want to use for data extraction. You can enter a pre-defined parameter in this field to pass the PDF file as a parameter and not the default file. |
FolderPath |
The location of the folder where the excel file needs to be created to save the extracted data. By default, the excel file gets created in the %localappdata% > EdgeVerve> AutomationStudio > ProtonFiles> PdfRepository folder. You can specify a folder location of your choice to over write the default location. |
PageRange |
The range of pages that you want to retrieve. For specifying a single page, enter the page number in double quotes, for example, "5". You can specify a range of pages by providing the range in double quotes, for example, "2-5", or "All" for all the pages. Only string types are supported. By default, it is cleared. |
TemplateName |
The name of the configured PDF template to extract data. Alternatively, you can select the required template in the Template Location field of the activity block. The template selected in the Properties grid reflects in the activity block and vice versa. |