Goal:
This article explains how to access Azure Open Dataset from Spark.
Env:
spark-3.1.1-bin-hadoop2.7
Solution:
Microsoft Azure Open Dataset is curated and cleansed data - including weather, census, and holidays - that you can use with minimal preparation to enrich ML models.
If we want to access it from local Spark environment, we need 2 jars :
- azure-storage-<version>.jar
- hadoop-azure-<version>.jar
My Spark is built on Hadoop 2.7, so I have to use a relatively older hadoop-zure jar.
In this example, I downloaded below two jars:
1. Add above 2 jars into Spark classpath.
spark.executor.extraClassPath
spark.driver.extraClassPath
2. Add Azure Blob Storage related Hadoop configs
For example, I choose to add them directly into Jupyter notebook(or you can add them into core-site.xml):
sc._jsc.hadoopConfiguration().set("fs.azure","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.Wasbs")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
3. Follow PySpark commands to access Azure Open Dataset
For example, the PySpark commands are here for accessing "NYC Taxi - Yellow" Azure Open Dataset.
This configuration solved my problem. I am able to run pyspark commands. Thanks to you
ReplyDeletehttps://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
UFPD