MongoDB query filters using Stratio's Spark-MongoDB library -


i'm trying query mongodb collection using stratio's spark-mongodb library. followed this thread started , i'm running following piece of code:

reader = sqlcontext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='<ip>:27017', database='<db>', collection='<col>').load() 

this load whole collection spark dataframe , collection large, it's taking lot of time. there way specify query filters , load selected data spark?

spark dataframe processing requires schema knowledge. when working data sources flexible and/or unknown schema, before spark can data, has discover schema. load() does. looks @ data purpose of discovering schema of data. when perform action on data, e.g., collect(), spark read data processing purposes.

there 1 way radically speed load() , that's providing schema , obviating need schema discovery. here example taken the library documentation:

import org.apache.spark.sql.types._ val schemamongo = structtype(structfield("name", stringtype, true) :: structfield("age", integertype, true ) :: nil) val df = sqlcontext.read.schema(schemamongo).format("com.stratio.datasource.mongodb").options(map("host" -> "localhost:27017", "database" -> "highschool", "collection" -> "students")).load 

you can slight gain sampling fraction of documents in collection setting schema_samplingratio configuration parameter value less 1.0 default. however, since mongo doesn't have sampling built in, you'll still accessing potentially lot of data.


Comments

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -