MongoDB query filters using Stratio's Spark-MongoDB library -

- February 15, 2013

i'm trying query mongodb collection using stratio's spark-mongodb library. followed this thread started , i'm running following piece of code:

reader = sqlcontext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='<ip>:27017', database='<db>', collection='<col>').load()

this load whole collection spark dataframe , collection large, it's taking lot of time. there way specify query filters , load selected data spark?

spark dataframe processing requires schema knowledge. when working data sources flexible and/or unknown schema, before spark can data, has discover schema. load() does. looks @ data purpose of discovering schema of data. when perform action on data, e.g., collect(), spark read data processing purposes.

there 1 way radically speed load() , that's providing schema , obviating need schema discovery. here example taken the library documentation:

import org.apache.spark.sql.types._ val schemamongo = structtype(structfield("name", stringtype, true) :: structfield("age", integertype, true ) :: nil) val df = sqlcontext.read.schema(schemamongo).format("com.stratio.datasource.mongodb").options(map("host" -> "localhost:27017", "database" -> "highschool", "collection" -> "students")).load

you can slight gain sampling fraction of documents in collection setting schema_samplingratio configuration parameter value less 1.0 default. however, since mongo doesn't have sampling built in, you'll still accessing potentially lot of data.

Search This Blog

HTPPS

MongoDB query filters using Stratio's Spark-MongoDB library -

Comments

Post a Comment

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -