How to Automatically Clean up, Validate Data, and Compare Multiple Large Excel or CSV files

In this post, you’ll learn not only how to clean, validate, and compare multiple Excel/CSV files, but also how to automate this entire process with Dropbase, saving you at least 75% of the time you take to do this today, and more for recurring use-cases. To learn how, continue reading.

Dropbase is the easiest way to import, clean, and validate Excel and CSV files from customers and partners directly into databases that seamlessly integrate with your favorite data apps and analytics tools.

Many of our users work in operations and get external data from partners, customers, or suppliers. This data comes in as Excel or CSV files and is constantly updated. In some cases, new data comes in on a daily or regular basis. The data not only contains new products to be added to a catalog, directory, or database, but it also includes updates to data previously sent, such as pricing updates to existing products. Because this data is messy and comes in with inconsistent quality, it has to be cleaned before it’s any useful. On top of all of this, there are multiple spreadsheets to cross-check and compare because data comes from different systems that don’t talk to each other and no integrations or APIs are built up. And even if they did, the fact is that some of that workflow still needs to happen in your spreadsheets.

Aside from the fact that it’s a manual, tedious, and time-consuming process to get this work done, there are other problems:

  • Spreadsheets start getting slow with larger files and more rows
  • To succeed in importing the data into the database, it has to be cleaned and validated. For example, you need ensure that emails, phone numbers, zip codes, product codes, categories are formatted correctly, and amounts and numbers fall within reasonable ranges for sanity check
  • Dates are inconsistently formatted because some of the data was manually entered and your spreadsheet can’t easily do this
  • The files received are constantly changing and/or updating, so you have to go back into your master spreadsheet to figure out if you need to update any of the records in there. Opening that master spreadsheet takes time. It crashes often
  • The use case requires cross-checking, cross referencing, or comparing two or more Excel files, CSV files, or spreadsheets. None of the build-in spreadsheet functions or features allow this in a convenient way
  • Finally, after the data is cleaned and validated, when trying to import it into the database, it will complain that not all the columns match or not all the data types are correct
  • This is on a “good day”. Most of the time there are a lot of other problems and corner cases

What options do you have?

First, you can Google ways to do this within Excel or Google Sheets. There are many solutions out there such as opening two Excel side by side, adding a conditional to check if two sheets are the same, and even adding conditional formatting to point out differences quickly. If you want to stay within Excel this makes a lot of sense. The problem is solvable, but it takes time. However, if you want to automate this and spend way less time than you should, then might want to try Dropbase.

In this tutorial I’ll show you how to not only solve these problems, but also automate most (if not ALL) of this arduous process. Along the way, you quickly see how this workflow can save yourself at least 75% of the manual effort you’re spending today.

To start with, I have 2 large spreadsheets, named “Customer Data - System 1” and “Customer Data - System 2”. They are extracted from System 1 and System 2, respectively. The two files are almost exactly the same. The only difference is that one customer shows as having a "pro" plan in one and "enterprise" in the other.

Customer data system 1 excel
Customer data system 2 google sheets

Let’s get to it!

1. Create tables and upload files

The first step is to import these spreadsheets into Dropbase. Each spreadsheet will be imported to its corresponding database table. In this post, we’ll assume that you want to import them into new tables. Note that you can also continue importing more data to existing tables.

To create database tables in Dropbase, hover over the folder you’d like to create the table in and click on “Create Table”. Type in a name for the table e.g. “Large Spreadsheet 1” and drag and drop Large Spreadsheet 1.

When uploaded, click Next. In the next screen, review that your data is displayed correctly and that we’ve selected the correct row for the column header. If not, you can change it by adding steps. If everything looks good, click “Next”.

In the next screen review and edit all the column names to make them shorter or more readable. Review that all the column types have been inferred correctly and update as required. You can optionally add any validation checks you’d like Dropbase to run for you. When you add a check, we will highlight any values that do not meet the check requirement. For example, if you add an Email check on the email column, we’ll check to ensure no value that isn’t formatted as a correct email can be imported. Click on “Create”

Create table and add validation

2. Clean & Validate

Dropbase will highlight values that either do not meet Check criteria, incorrect data, or data type mismatches. For example, in the screenshot, the emails highlighted violate the Check constraint, the “age” column contains text, and dates are ambiguous. When all the validation errors are fixed and the data is cleaned up, there will be no more error messages and you can now proceed to the next step. Click “Next”.

Clean and validate your data

3. Merge all your data

You can now proceed to load the cleaned and validated data to the database table you created in the first step. In this post we’ll use the Merge option that allows you to insert new records and update existing ones. This is a powerful feature that helps you easily build a master dataset. To do this, select one or more columns to use as the key or unique key combination. Then select the columns you’d like to update when those keys match between the incoming data and the destination table. Select to insert new records on unmatched rows and click on “Load to Database”

Merge data using email as key in Dropbase

4. Compare tables and saving results as Views

Before proceeding to compare tables, you must complete all the previous steps again for “Large Spreadsheet 2”. Once you finalize this, you will have 2 tables with the data imported from the 2 spreadsheets you started with.

From the app’s navbar, go to “Query” to access the query feature. Copy and paste in the following command. If you are using your own data and your own spreadsheets and table names, you can customize all those names in the query and re-run it to suit your specific names.

SQL for comparing tables

Text below for easy copy and paste:

select s1."email", s1."firstname", s1."lastname", s1."plan", s1."date" from "PUBLIC"."Large Spreadsheet 1" as s1except select s2."email", s2."firstname", s2."lastname", s2."plan", s2."date" from "PUBLIC"."Large Spreadsheet 2" as s2union(select s2."email", s2."firstname", s2."lastname", s2."plan", s2."date" from "PUBLIC"."Large Spreadsheet 2" as s2except select s1."email", s1."firstname", s1."lastname", s1."plan", s1."date" from "PUBLIC"."Large Spreadsheet 1" as s1);

When you run this query you will see the row where data differs across both tables. You can now save this query and its corresponding result as a “View”. A “View” is like a saved query that you can easily access from the Database Manager (through the navigation tree on the left). Next to the query results, click on "Save as View", chose a folder, and enter a name e.g. "Spreadsheet 1 and 2 Mismatch". Click "Save"

Mismatched data

5. Automating the entire process

Back in the first step, we started with 2 spreadsheets, but did not talk about where they came from. If those files are manually extracted, you’ll have to re-import them each time they are updated. While Dropbase automates part of the data cleaning by remembering your cleaning steps, and remembering how to load data to the database, there’s no way around having to do this. However, if you happen to receive those files via email attachments, Dropbase could help you automate most of the process through a new feature called “Dropmail”. This feature lets you email attachments or auto-forward attachments to a special email address. When an attachment is received via this address, it automatically extracts the file, cleans as much of it as possible, and if after the clean up there aren’t any other errors that require user input, it will automatically load to the database and update any records that have matching keys. It works like magic! If you set up auto-forward rules for each of the 2 spreadsheets, Dropbase will process each into its corresponding table!

Copy dropmail

Additionally, if you also saved the query in the previous step as a ‘View”, any time you click on it, you will get the most up to date data! If your Dropmail automation is correctly set up, you can count on the Views always having the latest data that results in comparing the 2 spreadsheets. This is because “Views” are real-time queries based on the underlying tables. When the table gets updated, the view is install and automatically updated too!

That’s it!

To summarize, in this article, you’ve started with 2 large spreadsheets that you receive via email from your external partners on a regular basis (weekly). You can import them into Dropbase, create a comparison table on the fly, and fully automate your entire manual workflow. With Dropbase, you go from having to manually download and open each spreadsheet every week, copying and pasting data back and forth, writing VLOOKUPs or other ways to compare tables, and then re-doing the same process week after week to basically just opening Dropbase and clicking on your comparison table.

Automate your workflow today. Automatically import, validate, and compare large spreadsheets with minimal effort, even if the data is constantly changing. Request a demo.