Exceptions When Reading Tutorial CSV File In The Cloudera VM

July 29, 2022 Post a Comment

I'm trying to do a Spark tutorial that comes with the Cloudera Virtual Machine. But even though I'm using the correct line-ending encoding, I can not execute the scripts, because I

Solution 1:

Summary of the discussion: Executing the following command solved the issue:

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

see https://www.coursera.org/learn/bigdata-analytics/supplement/tyH3p/setup-pyspark-for-dataframes for more info.

Solution 2:

Seems that there are two problems. First, the hive-metastore was offline in some occasions. And second, the schema can not be inferred. Therefore I created a schema manually and added it as an argument when loading the CSV file. Anyway, I would love to understand if this works somehow with schemaInfer=true.

Here's my version with the manually defined schema. So, make sure the hive is started:

sudo service hive-metastore restart

Then, have a look into the first part of the CSV file to understand it's structure. I used this command line:

head /usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv

Now, open the python shell. See the original posting for how to do that. Then define the schema:

from pyspark.sql.types import *
schema = StructType([
    StructField("business_id", StringType(), True),
    StructField("cool", IntegerType(), True),
    StructField("date", StringType(), True),
    StructField("funny", IntegerType(), True),
    StructField("id", StringType(), True),
    StructField("stars", IntegerType(), True),
    StructField("text", StringType(), True),
    StructField("type", StringType(), True),
    StructField("useful", IntegerType(), True),
    StructField("user_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("full_address", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
    StructField("neighborhood", StringType(), True),
    StructField("open", StringType(), True),
    StructField("review_count", IntegerType(), True),
    StructField("state", StringType(), True)])

Then load the CSV file by specifying the schema. Note that there is no need to set the windows line endings:

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
schema = schema,
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

The the result by any method executed on the dataset. I tried getting the count, which worked perfectly.

yelp_df.count()

Thanks to the help of @yaron we could figure out how to load the CSV with inferSchema. First, you must setup the hive-metastore correctly:

sudo cp /etc/hive/conf.dist/hive-site.xml /usr/lib/spark/conf/

Then, start the Python shell and DO NOT change the line endings to Windows encoding. Keep in mind that changing that is persistent (session invariant). So, if you changed it to Windows style before, you need to reset it it '\n'. Then load the CSV file with inferSchema set to true:

yelp_df = sqlCtx.load(source='com.databricks.spark.csv',
header = 'true',
inferSchema = 'true',
path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_configs_yelp_demo/index_data.csv')

Python Courses, Training, and Tutorials

Exceptions When Reading Tutorial CSV File In The Cloudera VM

Solution 1:

Solution 2:

Post a Comment for "Exceptions When Reading Tutorial CSV File In The Cloudera VM"