profile
viewpoint
Matthew Powers MrPowers Medivo New York http://www.codequizzes.com/ Data engineer at prognos.ai. Like Scala, Spark, Ruby, data, and math.

holdenk/spark-testing-base 1181

Base classes to use when writing tests with Spark

MrPowers/code_quizzer 192

Programming practice questions with Ruby, JavaScript, Rails, and Bash.

MrPowers/chispa 24

PySpark test helper methods with beautiful error messages

jasonsatran/spark-meta 17

Spark data profiling utilities

lizparody/awesome-programming-workout 7

This is the workout I will do to become a brilliant programmer and work in a U.S company

MrPowers/ceja 6

PySpark phonetic and string matching algorithms

MrPowers/awesome-spark 5

A curated list of awesome Apache Spark packages and resources.

push eventMrPowers/spark-frameless

MrPowers

commit sha 8056645a2a36ecff36abaab43d2214296ea08221

Explore basic features of typed datasets

view details

push time in a day

issue openedtypelevel/frameless

Easier withColumn method

Great work on this lib! It's a great way to write Spark code!

As discussed here and in the docs, withColumn requires a full schema when a column is added.

Here's the example in the docs:

case class CityBedsOther(city: String, bedrooms: Int, other: List[String])

cityBeds.
   withColumn[CityBedsOther](lit(List("a","b","c"))).
   show(1).run()

Couldn't we just assume that the schema stays the same for the existing columns and only supply the schema for the column that's being added?

cityBeds.
   withColumn[List[String]](lit(List("a","b","c"))).
   show(1).run()

I think this'd be a lot more use friendly. I'm often dealing with schemas that have tons of columns and add lots of columns with withColumn. Let me know your thoughts!

created time in a day

push eventMrPowers/cali

MrPowers

commit sha 70f5125f062d72f16af9872447d1955cc860f26a

Add Java, Scala, IntelliJ, and Terminal setup instructions

view details

push time in 5 days

pull request commentMrPowers/spark-fast-tests

fix: show missing content in case datasets don't match

Thanks for the PR. Let me know if you have any other suggestions for improvements. Open an issue or submit a PR anytime. Thanks!!!

cchepelov

comment created time in 5 days

push eventMrPowers/spark-fast-tests

Cyrille Chépélov

commit sha a392dea9cb4fc7532c9cd4a94fee6de3fc5a3087

fix: show missing content in case datasets don't match (#82)

view details

push time in 5 days

PR merged MrPowers/spark-fast-tests

fix: show missing content in case datasets don't match

contrary to the method showing schema mismatches, betterContentMismatchMessages can show 'there are diffs' but not show the actual difference in case the difference is data missing from either side.

This patch adds tests, and switches one ".zip" to a ".zipAll" to fix this.

+74 -1

0 comment

2 changed files

cchepelov

pr closed time in 5 days

push eventMrPowers/cali

MrPowers

commit sha 4268243be7fa51836f6787dcbf03fca7b0779def

Add Java, Scala, SBT, Spark installation guide

view details

push time in 5 days

push eventMrPowers/cali

MrPowers

commit sha caf3010eef10b7e272778cb24a483e44a560e885

Add VSCode setup instructions

view details

push time in 6 days

push eventMrPowers/cali

MrPowers

commit sha 18d0c950f62c9c61e3aa7fb8d663feac7fac8496

Guides uses zsh instead of bash now

view details

push time in 6 days

push eventMrPowers/cali

MrPowers

commit sha 51104e3839434a6fa4820538a808862054e9bfa3

Update Python installation guide

view details

push time in 6 days

push eventMrPowers/cali

Powers

commit sha 9d8c9db83a7dcd368bbeff22ea05902fa17973e0

Add Git and GitHub setup instructions

view details

push time in 6 days

push eventMrPowers/cali

Matthew Powers

commit sha dac29378e3f0e4d5de6f9cbe12f753d289559423

Add GitHub setup instructions

view details

push time in 6 days

startedMrPowers/spark-frameless

started time in 7 days

create barnchMrPowers/spark-frameless

branch : main

created branch time in 7 days

created repositoryMrPowers/spark-frameless

Typed Datasets with Spark

created time in 7 days

startedimarios/frameless.g8

started time in 8 days

startedtypelevel/frameless

started time in 8 days

push eventMrPowers/delta-examples

MrPowers

commit sha 36f230d6acbd2e47f94c34389ef8e457561f98be

bump Spark / library versions

view details

push time in 8 days

push eventMrPowers/spark-records

MrPowers

commit sha 5b65aaeec23a613fccc74cabc58d1f831af91ed9

Add resolver to fetch the spark-test-sugar dependency

view details

Simeon Simeonov

commit sha b9e65b900b8b0af26d0717d0fbc1c689c9948123

Merge pull request #7 from MrPowers/second-attempt-fix-build Add resolver to fetch the spark-test-sugar dependency

view details

MrPowers

commit sha 412fe5125b61b00b490e01f249ae4af1ce8b714b

Remove spark-test-sugar dependency

view details

MrPowers

commit sha 57aa9b727662df0f776e62cbb757f5efd01f7830

Use 2 cores when running tests

view details

Simeon Simeonov

commit sha 5ea2935e015186c284e2ed87494d744fc3e182d3

Merge pull request #9 from MrPowers/remove-test-sugar Remove spark-test-sugar dependency

view details

push time in 8 days

Pull request review commentswoop-inc/spark-records

Remove spark-test-sugar dependency

 package examples.fancy_numbers +import com.swoop.spark.SparkSessionTestWrapper import com.swoop.spark.records._-import com.swoop.spark.test.SparkSqlSpec import org.apache.spark.sql.Dataset import org.apache.spark.storage.StorageLevel  -class SparkTest extends ExampleSpec with SparkSqlSpec with TestNegative5To100 {+class SparkTest extends ExampleSpec with SparkSessionTestWrapper with TestNegative5To100 { +  val sc = spark.sparkContext   lazy val dc = SimpleDriverContext(sc)   lazy val jc = dc.jobContext(SimpleJobContext)   lazy val ds = recordsDataset(-5 to 100, jc)   lazy val records = ds.collect    "in an integration test" - {     implicit val env = FlatRecordEnvironment()-    val sqlContext = sqlc-    import sqlContext.implicits._+    import spark.implicits._      behave like fancyRecordBuilder(records, jc)      "should build records with Spark" in {       ds.count should be(105)     }+

Yep, agreed, updated!

MrPowers

comment created time in 11 days

PullRequestReviewEvent

Pull request review commentswoop-inc/spark-records

Remove spark-test-sugar dependency

+package com.swoop.spark++import org.apache.spark.sql.SparkSession++trait SparkSessionTestWrapper {++  lazy val spark: SparkSession = {+    SparkSession+      .builder()+      .master("local")+      .appName("spark-records")+      .config(

Copied this over from a project using scalafmt. Updated the code to put this all on one line. My Scala formatting is the worst, so fine formatting it however you like.

MrPowers

comment created time in 11 days

PullRequestReviewEvent

Pull request review commentswoop-inc/spark-records

Remove spark-test-sugar dependency

+package com.swoop.spark++import org.apache.spark.sql.SparkSession++trait SparkSessionTestWrapper {++  lazy val spark: SparkSession = {+    SparkSession+      .builder()+      .master("local")

Great point, updated to 2 cores & 4 shuffle partitions.

MrPowers

comment created time in 11 days

PullRequestReviewEvent

Pull request review commentswoop-inc/spark-records

Remove spark-test-sugar dependency

+# Set everything to be logged to the console+log4j.rootCategory=ERROR, console+log4j.appender.console=org.apache.log4j.ConsoleAppender+log4j.appender.console.target=System.err+log4j.appender.console.layout=org.apache.log4j.PatternLayout+log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n++# Settings to quiet third party logs that are too verbose+log4j.logger.org.eclipse.jetty=WARN+log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR+log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN+log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN

Fixed, good catch ;)

MrPowers

comment created time in 11 days

PullRequestReviewEvent

push eventMrPowers/spark-records

MrPowers

commit sha 57aa9b727662df0f776e62cbb757f5efd01f7830

Use 2 cores when running tests

view details

push time in 11 days

create barnchMrPowers/spark-records

branch : remove-test-sugar

created branch time in 12 days

push eventMrPowers/cali

MrPowers

commit sha 8bd6aadd3ebe8f16d2dbf103b4308f5547d5e228

Add a Ruby setup guide

view details

push time in 14 days

startedjekyll/jekyll

started time in 14 days

push eventMrPowers/cali

MrPowers

commit sha 9aa71c4af5aacfc9fd6a30a8572a4c9046ae5fc1

Update Python installation guide

view details

push time in 14 days

push eventMrPowers/cali

MrPowers

commit sha 9e2a493fbcb9d9edea24cf5a00ec16502eef31e2

Let Nerdtree show hidden files

view details

push time in 14 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 15966122c9fdaa3417b50516af9462382b87ffba

Add instructions on how to publish the microsite

view details

MrPowers

commit sha fc0512c9a850fbeb4ae33f56bbc3483560347325

Initial pass at microsite

view details

push time in 14 days

create barnchMrPowers/spark-stringmetric

branch : configure-microsite

created branch time in 14 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 459473330b0ec193ddefc20941bdd6486ee78b3c

updated site

view details

push time in 14 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha cdc21efead30d1d703fe151d1ef0473dba2f47bf

updated site

view details

push time in 14 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha ebc3770dcd3843c2ba10ccdf91f4c121cfb376e2

updated site

view details

push time in 14 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 619e4a05a5f11b7b1cd51d885595be4ca8743ff5

updated site

view details

push time in 14 days

issue openedswoop-inc/spark-records

Fix the GitHub page for this project

The GitHub page isn't working at the moment.

It's giving a 404 "There isn't a GitHub Pages site here" error.

The console says "Failed to load resource: the server responded with a status of 404 ()".

I say we update all the plugin versions to match spark-alchemy, regenerate the GitHub pages, and see if that fixes the problem. Sounds good?

created time in 15 days

issue closedMrPowers/spark-fast-tests

Publish to Maven Central?

I noticed that this library (which is awesome BTW!) was being published to Maven Central some time ago, but then stopped: https://search.maven.org/classic/#search|ga|1|spark-fast-tests (last release is in June 2018). Is it possible to publish newer versions to Maven Central again?

The issue is, using third-party repositories is often very hard in certain companies which only allow using a proxying artifacts repository which is configured to work with Maven Central and almost nothing else, and are very reluctant to add any other repository to proxy. Thus, there is really no way to use your library, because even if it works on the dev machines, it won't work on CI which is only allowed to use the internal repository.

closed time in 15 days

netvl

issue commentMrPowers/spark-fast-tests

Publish to Maven Central?

@netvl @marcostong17 - The artifacts have been published to Maven and can be accessed by adding this line to the build.sbt file:

libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "0.21.3" % "test"

Sorry it took so long and thanks for commenting.

netvl

comment created time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : remove_spark_packages

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : add_utest

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : remove_scalatest_dependency

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : failing_array_comparison

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : make_dataframe_comparer_generic

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : better-content-mismatch-message

delete time in 15 days

create barnchMrPowers/spark-fast-tests

branch : migrate-to-scalatest

created branch time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : migrate-to-scalatest

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : make-dataframe-output-better

delete time in 15 days

delete branch MrPowers/spark-fast-tests

delete branch : fix-ci-tests

delete time in 15 days

push eventMrPowers/spark-sbt.g8

MrPowers

commit sha 32f84a5ed93b84a7ef4de5ad5c457f49bdfdc6d4

Bump versions in the README

view details

push time in 15 days

push eventMrPowers/spark-sbt.g8

MrPowers

commit sha e21db4755d1a33b365590da24223e054acb9a408

Add example functions

view details

push time in 15 days

push eventMrPowers/spark-sbt.g8

MrPowers

commit sha d3fb5f286a6043aeee43ccb68875958833169af5

Pick a better name for the transformations, so the template is easier to use

view details

push time in 15 days

push eventMrPowers/spark-sbt.g8

MrPowers

commit sha 19db35685907ced1d3f5e343e8e062444e05f833

Provide guidance on Scala versions for different Spark versions

view details

push time in 15 days

push eventMrPowers/spark-sbt.g8

MrPowers

commit sha 01de69773aca5e84707e5aef0163bcbc49bf9521

Bump to latest library versions

view details

push time in 15 days

push eventMrPowers/spark-daria

MrPowers

commit sha 1df453b2de099e6f61d618d02f87020314f3a92b

Add detailed publishing steps

view details

push time in 18 days

push eventMrPowers/mrpowers.github.io

MrPowers

commit sha 76a02ac50ae8620a7d87e18303db71a8bb54cf5f

Remove documentation as that's now handled directly in the project repos

view details

push time in 18 days

push eventMrPowers/spark-daria

MrPowers

commit sha 06f144cc0968fbb7a40b661d53c46afc36380614

Update link to documentation

view details

push time in 18 days

push eventMrPowers/spark-daria

MrPowers

commit sha 1c3e71b261e66a89e3af38295f638e19dd0a51fc

updated site

view details

push time in 18 days

push eventMrPowers/spark-daria

MrPowers

commit sha 2c3e64b70f10ffa39a150d70adcac377894bf80e

Fix git remoterepo path

view details

push time in 18 days

delete branch MrPowers/spark-daria

delete branch : feature/higher_order_function_patch_03

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : feature/higher_order_function_patch_02

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : feature/higher_order_function_patch_01

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : feature/mill

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : remove_duplicate_code

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : eval-string

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : daria-writers

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : elt

delete time in 18 days

delete branch MrPowers/spark-daria

delete branch : new-sbt-publish-process

delete time in 18 days

create barnchMrPowers/spark-daria

branch : gh-pages

created branch time in 18 days

push eventMrPowers/spark-daria

MrPowers

commit sha d0495e88f493efe3b4fefdca8ac0ca47a44f0fe8

Add sbt-ghpages to publish documentation

view details

push time in 18 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 61c4cb157c57551d0c9f5c0bf669a08ebedcacd3

Add more detailed release instructions

view details

push time in 18 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 8d8dfc73764449810d0a0b33bfc8b930cfbb0fa8

Add link to latest API documentation

view details

push time in 18 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha fcbf0431a47ed63cdae94a232544655005861e0e

updated site

view details

push time in 18 days

create barnchMrPowers/spark-stringmetric

branch : gh-pages

created branch time in 18 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 0683e6293a02e0ba7b891595d3edabed172f8466

Add the sbt-ghpages plugin

view details

push time in 18 days

pull request commentMrPowers/spark-fast-tests

Automate publishing of artifacts

@nightscape - Thanks to your help, I am now able to publish spark-daria and spark-fast-tests properly in Maven. Really appreciate your help.

I am going to revisit the automated CI publishing in a couple of months. I am going to try to learn more about GPG, PGP, and just really understand what's going on here.

Can you send me an email (my email address is in my GitHub profile). Would like to brainstorm some spark-excel ideas with you!

nightscape

comment created time in 19 days

PR opened memsql/memsql-spark-connector

Bump spark-daria and spark-fast-tests versions

spark-daria and spark-fast-tests have been transitioned to a standard publishing process. They're both being cross compiled with Scala 2.11 & Scala 2.12.

This standard SBT dependency approach will make it easier for the memsql-spark-connector library to be cross compiled with Scala 2.12.

+4 -4

0 comment

1 changed file

pr created time in 19 days

create barnchMrPowers/memsql-spark-connector

branch : bump-mrpowers-dep-versions

created branch time in 19 days

startedminio/spark-select

started time in 19 days

PR opened swoop-inc/spark-records

Add resolver to fetch the spark-test-sugar dependency

Thanks for building this library 😄

I was getting this error when running sbt test:

[info] Resolving com.swoop#spark-test-sugar_2.11;1.5.0 ...
[warn] 	module not found: com.swoop#spark-test-sugar_2.11;1.5.0
[warn] ==== local: tried
[warn]   /Users/matthewpowers/.ivy2/local/com.swoop/spark-test-sugar_2.11/1.5.0/ivys/ivy.xml
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/com/swoop/spark-test-sugar_2.11/1.5.0/spark-test-sugar_2.11-1.5.0.pom
[warn] ==== local-preloaded-ivy: tried
[warn]   /Users/matthewpowers/.sbt/preloaded/com.swoop/spark-test-sugar_2.11/1.5.0/ivys/ivy.xml
[warn] ==== local-preloaded: tried
[warn]   file:////Users/matthewpowers/.sbt/preloaded/com/swoop/spark-test-sugar_2.11/1.5.0/spark-test-sugar_2.11-1.5.0.pom
[warn] ==== tpolecat: tried
[warn]   http://dl.bintray.com/tpolecat/maven/com/swoop/spark-test-sugar_2.11/1.5.0/spark-test-sugar_2.11-1.5.0.pom
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: com.swoop#spark-test-sugar_2.11;1.5.0: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::

It was looking in http://dl.bintray.com/tpolecat/maven/com/swoop/spark-test-sugar_2.11 for spark-test-sugar instead of in https://dl.bintray.com/swoop-inc/maven/. Adding the resolver in the build.sbt file fixes this build on my machine. Quite the strange error message!

+2 -0

0 comment

2 changed files

pr created time in 22 days

create barnchMrPowers/spark-records

branch : second-attempt-fix-build

created branch time in 22 days

fork MrPowers/spark-records

Bulletproof Apache Spark jobs with fast root cause analysis of failures.

https://swoop-inc.github.io/spark-records/

fork in 22 days

startedMrPowers/great-spark

started time in 22 days

create barnchMrPowers/great-spark

branch : master

created branch time in 22 days

startedswoop-inc/spark-records

started time in 22 days

starteddatastax/spark-cassandra-connector

started time in 22 days

created repositoryMrPowers/great-spark

Curated collection of Spark libraries and example applications

created time in 22 days

push eventMrPowers/spark-fast-tests

MrPowers

commit sha b051abf4b6d4a872d3fb2458e31b748d4ab8b09a

Remove continuous deployment from GitHub Actions

view details

push time in 22 days

create barnchMrPowers/spark-fast-tests

branch : fix-ci-tests

created branch time in 22 days

issue closedMrPowers/spark-daria

New release soon?

I'd love to use the new DariaWriters in a project wherein I need to output a single CSV file. This functionality seems yet unreleased. What is the timeline for the release that will include that functionality?

closed time in 23 days

colindean

issue commentMrPowers/spark-daria

New release soon?

@colindean - thanks for opening the issue and sorry for the delayed response. Here's how to access the latest version of the lib:

libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"

If you need anything else or have any suggestions on how we can make this library better, just let me know. Thanks again for opening the issue.

colindean

comment created time in 23 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha b99890b46720ab14696a949771d80de5da99c43f

Explain where JAR files are stored

view details

push time in 23 days

push eventMrPowers/spark-stringmetric

MrPowers

commit sha 5d942dbeb6bd1989f9ff83858eb282ac04f4108c

Update instructions to fetch dependency and to release the project

view details

push time in 23 days

more