How to fix ParquetJavaInvocationException ParquetJavaInvocationException when writing parquet file in ADLS using ADF

Rutger Verhaar | Adaptfy 1 Reputation point
2023-10-09T13:07:05.8166667+00:00

Hi,

I want to copy tables from an on premise sql server to my azure data lake and write the files in parquet fornat. I have installed the JRE on the machine which hosts the self-hosted IR.

When I create a dataset with parquet as format and select schema = none I am able to write files in parquet. However when I try to view the content of the parquet file using parquet viewer I get a corrupted file returned. I have also tried to take the parquet file as source and convert it to csv in sink in ADF but was running into failures too.

When I try to create a new dataset and set the schema to extract from connection/store I get the following error message:

An error occurred when invoking java, message: java.io.IOException:Error reading summaries

total entry:6

org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:190)

org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)

org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)

org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:202)

com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)

com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)

java.util.concurrent.ExecutionException:java.lang.ExceptionInInitializerError

total entry:9

java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)

java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)

org.apache.parquet.hadoop.ParquetFileReader.runAllInParallel(ParquetFileReader.java:227)

org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:185)

org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:112)

org.apache.parquet.hadoop.ParquetReader.<init>(ParquetReader.java:45)

org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:202)

com.microsoft.datatransfer.bridge.parquet.ParquetBatchReaderBridge.open(ParquetBatchReaderBridge.java:62)

com.microsoft.datatransfer.bridge.parquet.ParquetFileBridge.createReader(ParquetFileBridge.java:22)

java.lang.ExceptionInInitializerError:null

total entry:24

org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)

org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)

org.apache.hadoop.security.Groups.<init>(Groups.java:86)

org.apache.hadoop.security.Groups.<init>(Groups.java:66)

org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)

org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)

org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248)

org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763)

org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748)

org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621)

org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2753)

org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2745)

org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2611)

org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)

org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)

org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

org.apache.parquet.hadoop.ParquetFileReader.readSummaryMetadata(ParquetFileReader.java:360)

org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:158)

org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:155)

java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)

java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)

java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)

java.base/java.lang.Thread.run(Thread.java:1583)

java.lang.StringIndexOutOfBoundsException:Range [0, 3) out of bounds for length 2

total entry:34

java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)

java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)

java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)

java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)

java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)

java.base/jdk.internal.util.Preconditions.outOfBoundsCheckFromToIndex(Preconditions.java:112)

java.base/jdk.internal.util.Preconditions.checkFromToIndex(Preconditions.java:349)

java.base/java.lang.String.checkBoundsBeginEnd(String.java:4861)

java.base/java.lang.String.substring(String.java:2830)

org.apache.hadoop.util.Shell.<clinit>(Shell.java:49)

org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)

org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)

org.apache.hadoop.security.Groups.<init>(Groups.java:86)

org.apache.hadoop.security.Groups.<init>(Groups.java:66)

org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)

org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)

org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248)

org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763)

org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748)

org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621)

org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2753)

org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2745)

org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2611)

org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)

org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)

org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

org.apache.parquet.hadoop.ParquetFileReader.readSummaryMetadata(ParquetFileReader.java:360)

org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:158)

org.apache.parquet.hadoop.ParquetFileReader$1.call(ParquetFileReader.java:155)

java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)

java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)

java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)

java.base/java.lang.Thread.run(Thread.java:1583)

.

Does anyone know what is missing or what I can update to be able to write parquet files properly?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,510 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,044 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.