PostgreSQL Data Cleaning vs. Python Data Cleaning

Introduction

During analysis, you rarely—if ever—get to go directly from evaluating data to transforming and analyzing it. Sometimes, to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning—and that’s a lot of cleaning! To accomplish all this work, you may use Excel, R, or Python, but are these the best tools for your PostgreSQL data cleaning tasks?

In this blog post, I explore some classic data cleaning scenarios and show how you can perform them directly within your database using TimescaleDB and PostgreSQL, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL can't replace these tools entirely. However, they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.

Cleaning is a critical part of the analysis process. In my experience, it can be the most grueling! By cleaning data directly within my database, I can perform many of my cleaning tasks one time rather than repetitively within a script. This saves me considerable time in the long run.

🔖

Learn how you can perform time-series analysis in R.

How PostgreSQL Data Cleaning Fits in the Data Analysis Process

I began this series of posts on data analysis by presenting the following summary of the analysis process:

Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step — Data analysis lifecycle

The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis.

However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing data evaluation techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, data cleaning. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.

As I mentioned in my last post, my first job out of college was at an energy and sustainability solutions company that focused on monitoring utility usage—such as electricity, water, sewage, you name it—to determine how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.

Throughout this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database.

Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script.

This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project.

Instead, with PostgreSQL data cleaning, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make PostgreSQL data cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data.

To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our Slack channel, where you can ask the Timescale community specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!

Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for a free 30-day trial of Timescale (if you prefer self-hosting, you can always install and manage TimescaleDB on your own PostgreSQL instances). Learn more by following one of our many tutorials.

Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean,” I want to quickly cover the data set I'll be using. In addition, I want to note that all the code I show will assume you have some basic knowledge of SQL. If you aren't familiar with SQL, don’t worry! In my last post, I included a section on SQL basics.

About the Sample Dataset

In my experience within the data science realm, I have done most of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I'll use today, I'd likely do some initial cleaning before evaluation and then clean again after. Let me show you why.

I got the following IoT data set from Kaggle, where a very generous individual shared their energy consumption readings from their apartment in San Jose, CA, the data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:

Graphic showing the setup of the table. The tables name is 'energy_usage_staging'. each row contains the tables column and data types, the pairs of info are as follows ([type, text], [date, date], [start_time, time], [end_time, time], [usage, float4], [units,text], [cost, text], [notes, text])

and appears like this…

type	date	start_time	end_time	usage	units	cost
Electric usage	2016-10-22	00:00:00	00:14:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:15:00	00:29:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:30:00	00:44:00	0.01	kWh	$0.00
Electric usage	2016-10-22	00:45:00	00:59:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:00:00	01:14:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:15:00	01:29:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:30:00	01:44:00	0.01	kWh	$0.00
Electric usage	2016-10-22	01:45:00	01:59:00	0.01	kWh	$0.00

In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:

The cost is seen as a text data type which will cause some issues.
The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.
I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household).

In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!

PostgreSQL Data Cleaning

Here's what I've done in the past while working in data science. While these examples aren't exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.

Please feel free to explore these various techniques and skip around if you need! There's a lot here, and I designed it to be a helpful glossary of tools that you could use as needed.

The techniques I will cover include:

Note on cleaning approach:

There are many ways that I could approach PostgreSQL data cleaning. I could create a table then ALTER it as I clean, I could create multiple tables as I add or change data, or I could work with VIEWs. Depending on the size of my data, any of these approaches could make sense. However, they will have different computational consequences.

You may have noticed above that my raw data table was called energy_usage_staging. This is because I decided that given the state of my raw data, it's better to place the raw data in a staging table, clean it using VIEWs, and then insert it into a more usable table as part of my cleaning process.

This move from raw table to the usable table could happen even before the evaluation step of analysis. Sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL VIEWs, something I will talk about later.

Generally, if you're dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I'll show how to build up clean data using VIEWs along with additional tables. This method of cleaning is more efficient. It also sets up our next blog post about data transformation, which includes the use of scripts in PostgreSQL.

Correcting structural issues

Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have date and time columns separated and costs is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the cost column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.

TimescaleDB hypertables and why timestamp is important

For those of you not familiar with the structure of TimescaleDB hypertables, they're at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table.

The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some chunk based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.

Graphic showing a normal table vs a hypertable. The normal table just shows data in a table. The hypertable shows data in the table, but it also shows the data being "grouped" or "chunked" by day. By adding an index like structure based on time, queries can be more efficient.

Changing date-time structure

Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time-based queries, I want to restructure the energy_usage_staging table's date and time columns. I could use the date column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space-efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!

Looking back at the table structure, I should be able to get a usable timestamp value from the date and start_time columns as the end_time really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert: it's as simple as an algebraic statement. How cool is that?!

PostgreSQL code:

For my PostgreSQL data cleaning code, I can create the column without inserting it into the database. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet.

Let’s first compare the original columns with our newly generated columns. For this query, I simply add the two columns together. The AS keyword allows me to rename the column to whatever I would like, in this case, time.

--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;

Results:

date	start_time	time
2016-10-22	00:00:00	2016-10-22 00:00:00.000
2016-10-22	00:15:00	2016-10-22 00:15:00.000
2016-10-22	00:30:00	2016-10-22 00:30:00.000
2016-10-22	00:45:00	2016-10-22 00:45:00.000
2016-10-22	01:00:00	2016-10-22 01:00:00.000
2016-10-22	01:15:00	2016-10-22 01:15:00.000

Python code:

In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.

energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])

Changing column data types

Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the TO_NUMBER() function.

The format of the function is as follows: TO_NUMBER(‘text’, ‘format’) . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a $ symbol followed by a numeric set up 0.00. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.

I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’

PostgreSQL code:

--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC

Results:

cost	cost_new
$0.65	0.65
$0.65	0.65
$0.65	0.65
$0.57	0.57
$0.46	0.46
$0.46	0.46
$0.46	0.46
$0.46	0.46

Python code:

For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.

energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])

Creating a `VIEW`

Now that I know how to convert my columns, I can combine the two queries and create a VIEW of my new restructured table. A VIEW is a PostgreSQL object that allows you to define a query and call it by its VIEWs name as if it were a table within your database. I can use the following query to generate the data I want and then create a VIEW that I can query as if it were a table.

PostgreSQL code:

-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Results:

type	time	usage	units
Electric usage	2016-10-22 00:00:00.000	0.01	kWh
Electric usage	2016-10-22 00:15:00.000	0.01	kWh
Electric usage	2016-10-22 00:30:00.000	0.01	kWh
Electric usage	2016-10-22 00:45:00.000	0.01	kWh
Electric usage	2016-10-22 01:00:00.000	0.01	kWh
Electric usage	2016-10-22 01:15:00.000	0.01	kWh
Electric usage	2016-10-22 01:30:00.000	0.01	kWh
Electric usage	2016-10-22 01:45:00.000	0.01	kWh
Electric usage	2016-10-22 02:00:00.000	0.02	kWh
Electric usage	2016-10-22 02:15:00.000	0.02	kWh

I decided to call my VIEW energy_view. Now, when I want to do further cleaning, I can just specify its name in the FROM statement.

--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Python code:

energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))

It is important to note that with PostgreSQL VIEWs, the data inside of them have to be recalculated every time you query it. This is why we want to insert our VIEW data into a hypertable once we have the data set up just right. You can think of VIEWs as a shorthand version of the CTEs WITH AS statement I discussed in my last post.

We are now one step closer to cleaner data!

Creating or generating relevant data

With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a WHERE clause and specify where notes are not equal to an empty string.

PostgreSQL data cleaning code—detecting blank notes:

SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';

The results come out empty.

Python code:

print(energy_df[energy_df['notes'].notnull()])

Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the EXTRACT() command. The EXTRACT() command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.

PostgreSQL code:

--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew

Results:

type	time	usage	units	day_of_week
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6
Electric usage	2016-10-22 01:15:00.000	0.01	kWh	6

Python code:

energy_df['day_of_week'] = energy_df['time'].dt.dayofweek

Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. To do this, I create a boolean column, where true represents a weekend, and false represents a weekday. I then apply a CASE statement. With this command, I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say WHEN a day_of_week value is IN the set (0,6) THEN the output should be true, ELSE the value should be false.

PostgreSQL code:

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
	ELSE false
END AS is_weekend
FROM energy_view ew

Results:

type	time	usage	units	day_of_week	is_weekend
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6	true
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6	true

Fun fact: you can do the same query without a CASE statement, however it only works for binary columns.

--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew

Python code:

Notice that in Python, the weekends are represented by numbers 5 and 6 vs. the PostgreSQL weekend values 0 and 6.

energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))

And maybe things then start getting really crazy. Maybe you want to add more parameters!

Let’s consider holidays. Now, you may be asking, “Why in the world would we do that?!” but often, people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least some of the holidays off. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I create another boolean column identifying when a federal holiday occurs.

I can accomplish this using TimescaleDB’s time_bucket() function. The time_bucket() function is one of the functions I discussed in detail in my previous post. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the time_bucket() function, I would only see changes to the row associated with the 12 a.m. time period.

PostgreSQL code:

After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!

--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew

Results:

type	time	usage	units	day_of_week	is_weekend	is_holiday
Electric usage	2016-10-22 00:00:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:15:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:30:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 00:45:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 01:00:00.000	0.01	kWh	6	true	false
Electric usage	2016-10-22 01:15:00.000	0.01	kWh	6	true	false

Python code:

holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))

At this point, I’m going to save this expanded table into another VIEW so that I can call the data without writing out the query.

PostgreSQL code:

--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew

You may ask, “Why did you create these as boolean columns?!” A very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, you can use boolean columns to filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend AND a holiday. I can do this simply by adding a WHERE statement along with the specified columns.

PostgreSQL code:

--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday

Results:

type	time	usage	units	cost	day_of_week	is_weekend	is_holiday
Electric usage	2016-12-24 00:00:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:15:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:30:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 00:45:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 01:00:00.000	0.34	kWh	0.06	6	true	true
Electric usage	2016-12-24 01:15:00.000	0.34	kWh	0.06	6	true	true

Python code:

print(energy_df[(energy_df['is_weekend']==1) & (energy_df['is_holiday']==1)].head(10))

Adding data to a hypertable

Now I have new columns ready to go, and I know how I would like my table to be structured. I can now create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have run clean up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, either that is before or after evaluation.

PostgreSQL:

CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp

Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis.

I'll discuss this in detail in my next post. Make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!

Renaming values

Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple using PostgreSQL data cleaning.

When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE statement within the SELECT query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week column, but would prefer it to be a 7. I can do this with the following query.

PostgreSQL code:

SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage

Python code:

Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.

energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))

Now, let’s say that I wanted to use the names of the days of the week instead of showing numeric values. For this example, I want to ditch the CASE statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join it to this table using the JOIN command.

PostgreSQL:

--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week

Results:

type	time	usage	units	cost	day_of_week_name	weekend
Electric usage	2018-07-22 00:45:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:30:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:15:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-07-22 00:00:00.000	0.1	kWh	0.03	Sunday	true
Electric usage	2018-02-11 23:00:00.000	0.04	kWh	0.01	Sunday	true

Python:

In this case, python has similar mapping functions.

energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))

Hopefully, one of these techniques will be useful for you as you approach data renaming!

Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as AS (I couldn’t not use such a ridiculous statement 😂). When you use the SELECT statement, you can rename your columns like so,

PostgreSQL code:

SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;

Results:

usage_type	time_stamp	usage	units
Electric usage	2016-10-22 00:00:00.000	0.01	kWh
Electric usage	2016-10-22 00:15:00.000	0.01	kWh
Electric usage	2016-10-22 00:30:00.000	0.01	kWh
Electric usage	2016-10-22 00:45:00.000	0.01	kWh

Python code:

Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.

energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))

Fill in missing data

Another common problem in the PostgreSQL data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points. However, it's very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon.

This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this.

For example, let’s say that you are modeling energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values.

As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45 a.m. and 11:30 a.m.

time	energy
2021-01-01 07:00:00.000	0
2021-01-01 07:15:00.000	0.1
2021-01-01 07:30:00.000	0.1
2021-01-01 07:45:00.000	0.2
2021-01-01 11:30:00.000	0.04
2021-01-01 11:45:00.000	0.04
2021-01-01 12:00:00.000	0.03
2021-01-01 12:15:00.000	0.02
2021-01-01 12:30:00.000	0.03
2021-01-01 12:45:00.000	0.02
2021-01-01 13:00:00.000	0.03

I can use TimescaleDB’s gapfilling hyperfunctions to fill in these missing values. The interpolate() function is another one of TimescaleDB’s hyperfunctions. It creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the locf() hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the time_bucket_gapfill() function.

PostgreSQL code:

SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time < '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;

Results:

timestamp	interpolate	locf
2021-01-01 07:00:00.000	0.1	0.10000000000000000000
2021-01-01 07:30:00.000	0.15	0.15000000000000000000
2021-01-01 08:00:00.000	0.13625	0.15000000000000000000
2021-01-01 08:30:00.000	0.1225	0.15000000000000000000
2021-01-01 09:00:00.000	0.10875	0.15000000000000000000
2021-01-01 09:30:00.000	0.095	0.15000000000000000000
2021-01-01 10:00:00.000	0.08125	0.15000000000000000000
2021-01-01 10:30:00.000	0.0675	0.15000000000000000000
2021-01-01 11:00:00.000	0.05375	0.15000000000000000000
2021-01-01 11:30:00.000	0.04	0.04000000000000000000
2021-01-01 12:00:00.000	0.025	0.02500000000000000000
2021-01-01 12:30:00.000	0.025	0.02500000000000000000

Python code:

energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)

Bonus:

The following query shows how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE clause to specify the times I like to ignore (the second query).

SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time <= '2021-01-01 07:45:00.000' OR time >= '2021-01-01 11:30:00.000'

PostgreSQL Data Cleaning Wrap-Up

After reading through these various techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL data cleaning and TimescaleDB data cleaning provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you're looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB.

In my next posts, I'll discuss techniques for transforming data using PostgreSQL and TimescaleDB. I'll then use everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members.

If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.

Until next time!

Functionality Glossary:

Adding columns together
TO_NUMBER()
VIEW
WHERE
EXTRACT()
CASE
time_bucket()
JOIN
AS
CREATE TABLE
create_hypertable()
INSERT INTO
time_bucket_gapfill()

Ingest and query in milliseconds, even at terabyte scale.

This post was written by

Miranda Auhl

PostgreSQL Data Cleaning vs. Python Data Cleaning

Introduction

How PostgreSQL Data Cleaning Fits in the Data Analysis Process

About the Sample Dataset

PostgreSQL Data Cleaning

Note on cleaning approach:

Correcting structural issues

TimescaleDB hypertables and why timestamp is important

Changing date-time structure

Changing column data types

Creating a `VIEW`

Creating or generating relevant data

Adding data to a hypertable

Renaming values

Fill in missing data

PostgreSQL Data Cleaning Wrap-Up

Related posts

How I Learned to Love PostgreSQL on Kubernetes: Backup/Restore on Timescale

Data Normalization Tips: How to Weave Together Public Datasets to Make Sense of the World

How We Made PostgreSQL as Fast as Pinecone for Vector Data

Products

Solutions

Costumers

Developers

Pricing

Contact Us

Login

Products

Solutions

Developers

PostgreSQL Data Cleaning vs. Python Data Cleaning

Introduction

How PostgreSQL Data Cleaning Fits in the Data Analysis Process

About the Sample Dataset

PostgreSQL Data Cleaning

Note on cleaning approach:

Correcting structural issues

TimescaleDB hypertables and why timestamp is important

Changing date-time structure

Changing column data types

Creating a VIEW

Creating or generating relevant data

Adding data to a hypertable

Renaming values

Fill in missing data

PostgreSQL Data Cleaning Wrap-Up

Related posts

Creating a `VIEW`