Simple Python Scripts for CSV Manipulation

When working with large files in Excel, you probably experienced your computer or Excel freezing up. This is extremely frustrating, but you’ll be happy to know that you can use Python to do a lot of the daily tasks you use Excel for.

With Python, you’ll have to use Pandas, which is “an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.

You can think of Pandas as an extremely powerful version of Excel, with a lot more features.

Below I’ve listed several simple commands that you can use on Jupyter notebooks to manipulate CSVs when Excel won’t cut it.

Also at the end of this post we’ve listed some resources that can help you get Python and the Jupyter notebooks installed on your PC.

Merging files:

It’s true that Screaming Frog can now connect with Google Analytics, Ahrefs and Majestic. However, the connection and extraction of data in GA will only work if the GA account doesn’t list the domain. In some cases GA accounts use full URLs with the domain which is when this merge function comes in handy.

The code below, will merge the page report from Google analytics with the internal_html.csv export from Screaming Frog:

import pandas

a = pandas.read_csv("analytics.csv", encoding="UTF-8", header=0)
b = pandas.read_csv("internal_html.csv", encoding="UTF-8")
c = b.merge(a, left_on='Address', right_on='Page')
c.to_csv("audit.csv", index=False)

You can use a similar code to merge Majestic, Ahrefs and Screaming frog data:

import pandas

b = pandas.read_csv("majestic.csv")
c = pandas.read_csv("ahrefs.csv")
d = b.merge(c, left_on='URL', right_on='Page URL')
d.to_csv("merge.csv", index=False)
d = pandas.read_csv("merge.csv", encoding='ANSI')
a = pandas.read_csv("internal_html.csv", encoding="utf-8", header=1)
merged = a.merge(d, left_on='Address', right_on='URL')
merged.to_csv("fullmerge.csv", index=False)

Filters

This is really useful for managing large all_inlinks.csv reports from Screaming Frog. Sometimes, Excel isn’t able to load the file completely or doesn’t list all cells when using the filter functions. These are some of the messages you’ll see on Excel:

excel-message-1 excel-filters-1

When exporting the report from Screaming Frog you can choose to save as an .xls file, this will allow you to fully load the data on the file, but you’ll get a ton of tabs with data which will make it hard to analyze all of the data quickly.

With the code below, you can filter the URLs on the destination column that contains a specific keyword:

import pandas

df = pandas.read_csv("all_inlinks.csv")
df2 = df[df['Destination'].str.contains("keyword", na=False)]
df2.to_csv("filtered.csv", index=False)

The next code will allow you to filter broken links.

import pandas

df = pandas.read_csv("all_inlinks.csv")
df2 = df[df['Status Code']==404]

You can play around with these codes and filter different columns anyway you want.

Removing Duplicates

When using excel to work with URL redirects sometimes you’ll have URLs with capital letters and lower case letters. Looking at the example below, you’ll see that both cells are highlighted as duplicate when using conditional formating:

excel-duplicate-cells

The same applies if you delete duplicates using the remove duplicates function. Excel will delete either the capital or lower-case letter URL:

excel-remove_duplicates

This causes a problem when you don’t want to treat these upper and lower-case URLs as duplicates and want to keep both on your analysis. If you have thousands of URLs doing a manual check will take hours. This is when Python comes in handy, since it does not treat these as duplicates but can delete every other duplicate in your list.

On the code below, we will remove all of the duplicates in the column URL:

import pandas

df = pandas.read_csv("your.csv")
df2 = df.drop_duplicates(subset=['URL'], keep='first')
df2.to_csv("unique.csv", index=False)

Extracting Tables from HTML

This code is handy when you want to extract tables from html pages. Pandas can read table tabs off of html. See example below:

import pandas

df = pandas.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

You might need to install htmllib5, lxml, and BeautifulSoup4 for this code to run.

On your Jupyter notebook, you’ll get this result:

jupyter-output-1

And of course, you can save this data to a CSV file:

df.to_csv("banklist.csv", index=False)

Removing sections

There are some cases where you want to remove rows of data from a spreadsheet. Let's say you would like to remove all of the URLs from the Blog in your analysis. With the code below you can accomplish this:

import pandas
blog = df['URL'].str.contains('blog')
df2 = df[~blog]

Resources

Below I’ve included some resources that can help you install Anaconda which will install Python and Jupyter notebooks. This is the only thing you will need to run the codes on this post.

I hope this makes it easier for you to work with large files. If you have any other tips on how you use Pandas please list them on the comments below.