VOOZH about

URL: https://issues.apache.org/jira/browse/SPARK-23649

⇱ [SPARK-23649] CSV schema inferring fails on some UTF-8 chars - ASF Jira


Public signup for this instance is disabled. Go to our Self serve sign up page to request an account. Report potential security issues privately

Description

Schema inferring of CSV files fails if the file contains a char starts from 0xFF. 

spark.read.option("header", "true").csv("utf8xFF.csv")
java.lang.ArrayIndexOutOfBoundsException: 63
 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
 at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)

Here is content of the file:

hexdump -C ~/tmp/utf8xFF.csv
00000000 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un|
00000010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.|
00000020 2c 34 35 36 0d |,456.|
00000025

Schema inferring doesn't fail in multiline mode:

spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv")
+-------+-----+
|channel|code
+-------+-----+
| United| 123
| ABGUN�| 456
+-------+-----+

and Spark is able to read the csv file if the schema is specified:

import org.apache.spark.sql.types._
val schema = new StructType().add("channel", StringType).add("code", StringType)
spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
+-------+----+
|channel|code|
+-------+----+
| United| 123|
| ABGUN�| 456|
+-------+----+

Attachments

  1. utf8xFF.csv
    0.0 kB
    Max Gekk

Activity

People

👁 Unassigned
Unassigned
👁 maxgekk
Max Gekk
👁 Herman van Hövell
Herman van Hövell
Votes:
Vote for this issue
Watchers:
Start watching this issue

Dates

Created:
Updated:
Resolved: