Whether this is your first time jumping into data analytics or you're a seasoned veteran, I find there's always something new to learn when it comes to preparing your data for analysis. There are so many different functions that you can perform when cleaning and transforming your data for analysis, it can be hard to know where to start.
The list below is 10 of the most important functions for you to understand how to use, whether you're preparing your data in Excel, or manipulating it through a programming language like Python or R, or are using a software with prebuilt functions like Dropbase.
The find and replace function allows you to search through your dataset for a specific string or value, and then replace it with another value of your choosing. It's a function I find myself coming back to again and again. The ability to clean up your datasets by finding the errors and replacing them with the correct terms will save you countless hours, not to mention make your data more useful for analysis right away.
Example of how to use: Find the name of your old payment plans, and replace them with the new terms, so that you can create more accurate analyses.
Conditional statements, like if statements, allow you to control the flow of information and how manipulations are applied to your dataset. Often times you only want to perform a function if a certain criteria is met, and using conditional statements makes doing so easy. You can transform your data with increased precision, and ensure that your data is well prepped for analysis.
Example of how to use: If you want to add different amounts to the bills of customers depending on if they're based in Canada or the US, you can use a conditional statement to separate the two calculations based on where the client is located.
Concatenation allows you to combine multiple columns into one, while splitting allows you to turn one column into multiple. Depending on the purpose, both of these functions can be incredibly useful for improving the readability and usability of your dataset.
Example of how to use: If you have a day, month and year column and want to combine it into a more usable date column, you can use a concatenation function to satisfy your need. For a split, if you want to split an email address into everything after the "@" symbol to help identify which companies are in your dataset.
The trim functionality is another function that is both incredibly simple to use and useful for analytics. Trimming the values in a column allows you to keep only the specific portion of the data that you wish to use for your analysis. By keeping a set number of characters from the start, end or middle of the cell, you can clean up and standardize your data.
Example of how to use: If you have some sort of identifier code like "TOR-32245" or "NYC-32245" that represent the same product sold at two different stores, and only want to keep the product numbers, you could trim the first four characters. This would allow you to compare sales of different products, irrespective of the store where they were purchased.
Changing column types is another way to improve the usability of your data. By casting columns into their correct types, you'll be able to use functions and filters specifically designed for those column types, such as math operations with integer/decimal columns, string manipulation with text columns, or filtering or sorting by date using datetime columns.
Example of how to use: If you want to do further analysis on data over time, converting a column containing the date to be in a standardized DateTime format will improve the usability of your data.
The unique function removes duplicate entries from your dataset, allowing you to prepare the data better for analytics purposes. The unique function can also be used to generate aggregate statistics, such as the number of unique users that a dataset tracks.
Example of how to use: Running a unique function on a column that is expected to be unique values such as a transaction ID field, and removing any duplicate columns as to not inflate your transaction numbers.
Sorting your data allows you to quickly identify the data that matters the most to you. Especially if you're looking to analyze specific subsections of the data, sorting can help you to identify the rows that will be most (or least) beneficial for you to work with.
Examples of how to use: Sorting a list of prospects by their projected budget in order to focus on analyzing the jobs that bring in the most revenue first
Filtering your data lets you create subgroups of data that match (or don't match) certain parameters. This is useful for data analytics where your dataset needs to be split into subgroups for analysis based on specific characteristics. These filters will let you complete more specific, and more accurate calculations based on only the rows matching your conditions.
Examples of how to use: Filter users who have spent over $50 and who first signed up from referral links to identify how successful your customer referral campaign is.
When analyzing data, outliers can throw off your predictions and make your data unusable. By identifying data points that land more than a couple standard deviations away from the average value, you can vastly improve the quality of your data.
Examples of how to use: You have a dataset that includes home prices in your city, but one realtor reported their data in thousands (637 instead of 637,000). By removing these data points, you can perform more accurate analysis and predictions with your data.
Often times data sets may be missing some information that makes doing analysis a pain. Null values can skew your predictions and calculations, so removing them entirely or replacing them with a more reasonable value (such as the mean) is incredibly useful.
Example of how to use: Your dataset of sports scores has a null value every time a team doesn't score. Instead, you fill the null values with the correct score of zero, making it easier to perform analysis on.