pyspark.sql.DataFrame¶
-
class
pyspark.sql.DataFrame(jdf: py4j.java_gateway.JavaObject, sql_ctx: Union[SQLContext, SparkSession])[source]¶ A distributed collection of data grouped into named columns.
New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
Notes
A DataFrame should only be created as described above. It should not be directly created via using the constructor.
Examples
A
DataFrameis equivalent to a relational table in Spark SQL, and can be created using various functions inSparkSession:>>> people = spark.createDataFrame([ ... {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, ... {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, ... {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, ... {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} ... ])
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
DataFrame,Column.To select a column from the
DataFrame, use the apply method:>>> age_col = people.age
A more concrete example:
>>> # To create DataFrame using SparkSession ... department = spark.createDataFrame([ ... {"id": 1, "name": "PySpark"}, ... {"id": 2, "name": "ML"}, ... {"id": 3, "name": "Spark SQL"} ... ])
>>> people.filter(people.age > 30).join( ... department, people.deptId == department.id).groupBy( ... department.name, "gender").agg({"salary": "avg", "age": "max"}).show() +-------+------+-----------+--------+ | name|gender|avg(salary)|max(age)| +-------+------+-----------+--------+ | ML| F| 150.0| 60| |PySpark| M| 75.0| 50| +-------+------+-----------+--------+
Methods
agg(*exprs)Aggregate on the entire
DataFramewithout groups (shorthand fordf.groupBy().agg()).alias(alias)Returns a new
DataFramewith an alias set.approxQuantile(col, probabilities, relativeError)Calculates the approximate quantiles of numerical columns of a
DataFrame.cache()Persists the
DataFramewith the default storage level (MEMORY_AND_DISK_DESER).checkpoint([eager])Returns a checkpointed version of this
DataFrame.coalesce(numPartitions)Returns a new
DataFramethat has exactly numPartitions partitions.colRegex(colName)Selects column based on the column name specified as a regex and returns it as
Column.collect()Returns all the records as a list of
Row.corr(col1, col2[, method])Calculates the correlation of two columns of a
DataFrameas a double value.count()Returns the number of rows in this
DataFrame.cov(col1, col2)Calculate the sample covariance for the given columns, specified by their names, as a double value.
createGlobalTempView(name)Creates a global temporary view with this
DataFrame.Creates or replaces a global temporary view using the given name.
createOrReplaceTempView(name)Creates or replaces a local temporary view with this
DataFrame.createTempView(name)Creates a local temporary view with this
DataFrame.crossJoin(other)Returns the cartesian product with another
DataFrame.crosstab(col1, col2)Computes a pair-wise frequency table of the given columns.
cube(*cols)Create a multi-dimensional cube for the current
DataFrameusing the specified columns, so we can run aggregations on them.describe(*cols)Computes basic statistics for numeric and string columns.
distinct()Returns a new
DataFramecontaining the distinct rows in thisDataFrame.drop(*cols)Returns a new
DataFramewithout specified columns.dropDuplicates([subset])Return a new
DataFramewith duplicate rows removed, optionally only considering certain columns.dropDuplicatesWithinWatermark([subset])Return a new
DataFramewith duplicate rows removed,drop_duplicates([subset])drop_duplicates()is an alias fordropDuplicates().dropna([how, thresh, subset])Returns a new
DataFrameomitting rows with null values.exceptAll(other)Return a new
DataFramecontaining rows in thisDataFramebut not in anotherDataFramewhile preserving duplicates.explain([extended, mode])Prints the (logical and physical) plans to the console for debugging purposes.
fillna(value[, subset])Replace null values, alias for
na.fill().filter(condition)Filters rows using the given condition.
first()Returns the first row as a
Row.foreach(f)Applies the
ffunction to each partition of thisDataFrame.freqItems(cols[, support])Finding frequent items for columns, possibly with false positives.
groupBy(*cols)Groups the
DataFrameusing the specified columns, so we can run aggregation on them.groupby(*cols)groupby()is an alias forgroupBy().head([n])Returns the first
nrows.hint(name, *parameters)Specifies some hint on the current
DataFrame.Returns a best-effort snapshot of the files that compose this
DataFrame.intersect(other)Return a new
DataFramecontaining rows only in both thisDataFrameand anotherDataFrame.intersectAll(other)Return a new
DataFramecontaining rows in both thisDataFrameand anotherDataFramewhile preserving duplicates.isEmpty()Checks if the
DataFrameis empty and returns a boolean value.isLocal()Returns
Trueif thecollect()andtake()methods can be run locally (without any Spark executors).join(other[, on, how])Joins with another
DataFrame, using the given join expression.limit(num)Limits the result count to the number specified.
localCheckpoint([eager])Returns a locally checkpointed version of this
DataFrame.mapInArrow(func, schema[, barrier])Maps an iterator of batches in the current
DataFrameusing a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as aDataFrame.mapInPandas(func, schema[, barrier])Maps an iterator of batches in the current
DataFrameusing a Python native function that takes and outputs a pandas DataFrame, and returns the result as aDataFrame.melt(ids, values, variableColumnName, …)Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
observe(observation, *exprs)Define (named) metrics to observe on the DataFrame.
offset(num)Returns a new :class: DataFrame by skipping the first n rows.
orderBy(*cols, **kwargs)Returns a new
DataFramesorted by the specified column(s).pandas_api([index_col])Converts the existing DataFrame into a pandas-on-Spark DataFrame.
persist([storageLevel])Sets the storage level to persist the contents of the
DataFrameacross operations after the first time it is computed.printSchema([level])Prints out the schema in the tree format.
randomSplit(weights[, seed])Randomly splits this
DataFramewith the provided weights.registerTempTable(name)Registers this
DataFrameas a temporary table using the given name.repartition(numPartitions, *cols)Returns a new
DataFramepartitioned by the given partitioning expressions.repartitionByRange(numPartitions, *cols)Returns a new
DataFramepartitioned by the given partitioning expressions.replace(to_replace[, value, subset])Returns a new
DataFramereplacing a value with another value.rollup(*cols)Create a multi-dimensional rollup for the current
DataFrameusing the specified columns, so we can run aggregation on them.sameSemantics(other)Returns True when the logical query plans inside both
DataFrames are equal and therefore return the same results.sample([withReplacement, fraction, seed])Returns a sampled subset of this
DataFrame.sampleBy(col, fractions[, seed])Returns a stratified sample without replacement based on the fraction given on each stratum.
select(*cols)Projects a set of expressions and returns a new
DataFrame.selectExpr(*expr)Projects a set of SQL expressions and returns a new
DataFrame.Returns a hash code of the logical query plan against this
DataFrame.show([n, truncate, vertical])Prints the first
nrows to the console.sort(*cols, **kwargs)Returns a new
DataFramesorted by the specified column(s).sortWithinPartitions(*cols, **kwargs)Returns a new
DataFramewith each partition sorted by the specified column(s).subtract(other)Return a new
DataFramecontaining rows in thisDataFramebut not in anotherDataFrame.summary(*statistics)Computes specified statistics for numeric and string columns.
tail(num)Returns the last
numrows as alistofRow.take(num)Returns the first
numrows as alistofRow.to(schema)Returns a new
DataFramewhere each row is reconciled to match the specified schema.toDF(*cols)Returns a new
DataFramethat with new specified column namestoJSON([use_unicode])Converts a
DataFrameinto aRDDof string.toLocalIterator([prefetchPartitions])Returns an iterator that contains all of the rows in this
DataFrame.toPandas()Returns the contents of this
DataFrameas Pandaspandas.DataFrame.to_koalas([index_col])to_pandas_on_spark([index_col])transform(func, *args, **kwargs)Returns a new
DataFrame.union(other)Return a new
DataFramecontaining the union of rows in this and anotherDataFrame.unionAll(other)Return a new
DataFramecontaining the union of rows in this and anotherDataFrame.unionByName(other[, allowMissingColumns])Returns a new
DataFramecontaining union of rows in this and anotherDataFrame.unpersist([blocking])Marks the
DataFrameas non-persistent, and remove all blocks for it from memory and disk.unpivot(ids, values, variableColumnName, …)Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
where(condition)withColumn(colName, col)Returns a new
DataFrameby adding a column or replacing the existing column that has the same name.withColumnRenamed(existing, new)Returns a new
DataFrameby renaming an existing column.withColumns(*colsMap)Returns a new
DataFrameby adding multiple columns or replacing the existing columns that have the same names.withColumnsRenamed(colsMap)Returns a new
DataFrameby renaming multiple columns.withMetadata(columnName, metadata)Returns a new
DataFrameby updating an existing column with metadata.withWatermark(eventTime, delayThreshold)Defines an event time watermark for this
DataFrame.writeTo(table)Create a write configuration builder for v2 sources.
Attributes
Retrieves the names of all columns in the
DataFrameas a list.Returns all column names and their data types as a list.
Returns
Trueif thisDataFramecontains one or more sources that continuously return data as it arrives.Returns a
DataFrameNaFunctionsfor handling missing values.Returns the content as an
pyspark.RDDofRow.Returns the schema of this
DataFrameas apyspark.sql.types.StructType.Returns Spark session that created this
DataFrame.sql_ctxReturns a
DataFrameStatFunctionsfor statistic functions.Get the
DataFrame’s current storage level.Interface for saving the content of the non-streaming
DataFrameout into external storage.Interface for saving the content of the streaming
DataFrameout into external storage.