Voozh

Details

👁 Bug - A problem which impairs or prevents the functions of the product.
Bug
Status: Closed
👁 Major - Major loss of function.
Major
Resolution: Fixed
2.3.0
2.2.2, 2.3.1, 2.4.0
SQL
None

Description

Schema inferring of CSV files fails if the file contains a char starts from 0xFF.

spark.read.option("header", "true").csv("utf8xFF.csv")

java.lang.ArrayIndexOutOfBoundsException: 63
 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
 at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)

Here is content of the file:

hexdump -C ~/tmp/utf8xFF.csv
00000000 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un|
00000010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.|
00000020 2c 34 35 36 0d |,456.|
00000025

Schema inferring doesn't fail in multiline mode:

spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv")

+-------+-----+
|channel|code
+-------+-----+
| United| 123
| ABGUN�| 456
+-------+-----+

and Spark is able to read the csv file if the schema is specified:

import org.apache.spark.sql.types._
val schema = new StructType().add("channel", StringType).add("code", StringType)
spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show

+-------+----+
|channel|code|
+-------+----+
| United| 123|
| ABGUN�| 456|
+-------+----+

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

utf8xFF.csv
11/Mar/18 17:25
0.0 kB
Max Gekk

Issue Links

links to: 👁 Pull request #20796
[Github] Pull Request #20796 (MaxGekk)

Activity

People

: 👁 Unassigned
Unassigned

: 👁 maxgekk
Max Gekk

: 👁 Herman van Hövell
Herman van Hövell

Votes:: Vote for this issue

Watchers:: Start watching this issue

Dates

Created:: 11/Mar/18 17:24

Updated:: 15/Feb/26 20:13

Resolved:: 31/May/18 03:42

URL: https://issues.apache.org/jira/browse/SPARK-23649

⇱ [SPARK-23649] CSV schema inferring fails on some UTF-8 chars - ASF Jira

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates