MongoDB query filters using Stratio's Spark-MongoDB library -
i'm trying query mongodb collection using stratio's spark-mongodb library. followed this thread started , i'm running following piece of code:
reader = sqlcontext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='<ip>:27017', database='<db>', collection='<col>').load()
this load whole collection spark dataframe , collection large, it's taking lot of time. there way specify query filters , load selected data spark?
spark dataframe processing requires schema knowledge. when working data sources flexible and/or unknown schema, before spark can data, has discover schema. load()
does. looks @ data purpose of discovering schema of data
. when perform action on data
, e.g., collect()
, spark read data processing purposes.
there 1 way radically speed load()
, that's providing schema , obviating need schema discovery. here example taken the library documentation:
import org.apache.spark.sql.types._ val schemamongo = structtype(structfield("name", stringtype, true) :: structfield("age", integertype, true ) :: nil) val df = sqlcontext.read.schema(schemamongo).format("com.stratio.datasource.mongodb").options(map("host" -> "localhost:27017", "database" -> "highschool", "collection" -> "students")).load
you can slight gain sampling fraction of documents in collection setting schema_samplingratio
configuration parameter value less 1.0
default. however, since mongo doesn't have sampling built in, you'll still accessing potentially lot of data.
Comments
Post a Comment