Twarc Utilities for Windows

This guide is optimized for Windows users (tutorial for MacOS users here). Windows 10 is required.

To use the following utilities, make sure the JSONL file of collected tweet data you want to use is in the same folder as the utils folder (but not inside this utils folder).

Important Note: At the time of writing this, Windows does not play well with certain Unicode UTF-8 characters, including emojis, that are commonly found in tweet datasets. Without making any changes to your computer’s settings, you will likely run into an error while using many of the twarc utilities that looks like this:

Screenshot of unicode encode error

Fix this: You can change your Region settings for Unicode UTF-8 character encoding. Follow the instructions, here. You can change these settings back later if you’d prefer.

For the following twarc lessons, all blue text can be modified to fit your purpose. These are either file names or search/filter terms.

Use the Table of Contents below to quickly navigate this guide.

Table of contents

  1. Tweet Count
  2. Convert JSON to CSV
  3. Users
  4. Hashtags
  5. Emojis
  6. Wordcloud
  7. Tweet Network Graph
  8. Create Geojson
  9. Tweet Wall
  10. Media URLs
  11. Remove Retweets from JSONL
  12. Sort Tweets Chronologically
  13. Remove Tweets Before Certain Date
  14. Remove Duplicate Tweets

Tweet Count

To get the total number of tweets in your dataset, enter the command below:

find /v /c “” tweets.jsonl

Convert JSON to CSV

Use the following command to create a csv file (spreadsheet):

python utils/json2csv.py tweets.jsonl > tweets.csv

Output: Creates a .csv file with the following tweet attributes: Id, tweet_url, created_at, parsed_created_at, user_screen_name, text, tweet_type, coordinates, hashtags, media, urls, favorite_count, in_reply_to_screen_name, in_reply_to_status_id, in_reply_to_user_id, lang, place, possibly_sensitive, retweet_count, retweet_or_quote_id, retweet_or_quote_screen_name, retweet_or_quote_user_id, source, user_id, user_created_at, user_default_profile_image, user_description, user_favourites_count, user_followers_count, user_friends_count, user_listed_count, user_location, user_name, user_statuses_count, user_time_zone, user_urls, user_verified

See the Tweet Data Dictionary for a complete description of these attributes.

Users

To count the number of unique users in the dataset, enter the following command:

python utils/users.py tweets.jsonl > users.txt

This will create a text file listing all users in the dataset.

Hashtags

To to get a list of all the unique hashtags used in the dataset, enter the following command:

python utils/tags.py tweets.jsonl > tweets_hashtags.txt

Emojis

To see a list of all emojis in the dataset, use the following command:

python utils/emojis.py tweets.jsonl > tweets.txt

Output: Creates a text file with all emojis and number count:

Screenshot of emoji txt file

Most Windows programs use black and white emoji characters (Notepad was used to open the text file above). You have a few options:

  1. Change the font to Segoe UI Emoji. In Notepad, click the option Format > Font to change font to Segoe UI Emoji: Screenshot of emoji txt file

    They will probably still be in black and white, but the emojis are more current and legible: Screenshot of emoji txt file

  2. To view in color, copy/paste the contents of your text file into a browser based text editor. Some options include:

Wordcloud

To create a wordcloud, use the following command (see note below):

python utils/wordcloud.py tweets.jsonl > wordcloud.html

Output: Creates an html file, open in your browser to see wordcloud.

Screenshot of wordcloud

Note on wordclouds: Consider removing retweets from your dataset when creating your wordcloud. With retweets included, the text from the original tweet is shortened in the dataset, and can create odd/deceptive wordclouds.

Example:

From original dataset: After retweets removed:
Screenshot of wordcloud Screenshot of wordcloud

That large red “arm…” in the original dataset was from a highly retweeted tweet, where the word “army” was cut off mid-word in the retweet data.

Tweet Network Graph

Create a network graph from your tweet data, use the following command (note there is no ‘>’ in this command between the jsonl file name and the html file name):

python utils/d3_network.py tweets.jsonl tweet_graph.html

Output: Creates an html file, open in your browser to view the network graph of your tweets. This script only shows a graph of original tweets that have been quoted and/or have replies (not the entire dataset). Hover over circles to show connections, click to show tweet (if you click and no tweet appears, it was likely deleted or set private)

Screenshot of tweet network

To include retweets in your network graph, use the following command (if you have a very large dataset, this will take a long time to load, and may be too crowded to read well):

python utils/d3_network.py --retweets tweets.jsonl tweet_graph.html

Output: an html file of your network graph, including retweets. Open in browser:

Screenshot of tweet network w retweets

Create Geojson

Use the following command:

python utils/geojson.py tweets.jsonl > tweets.geojson

Output: Creates a .geojson file. Quickly create map from geojson file:

Tweet Wall

Create a wall of tweets, use the following command:

python utils/wall.py tweets.jsonl > tweets.html

Output: Creates an html file, open in your browser to view all tweets, easy to browse through.

Screenshot of tweet wall

Media URLs

Creates a text file the URLs of images uploaded to Twitter in a tweet json. Use the following command:

python utils/media_urls.py tweets.jsonl > tweets_media.txt

Output: Creates a text document with the tweet ids and media urls.

For just a list of embedded media URLs (without the associated tweet id):

python utils/embeds.py tweets.jsonl > tweets_embeds.txt

Remove Retweets from JSONL

Creates a new JSONL file with retweets removed. Use the following command:

python utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl

Sort Tweets Chronologically

Tweet IDs are generated chronologically. This command will sort your tweet data by Tweet ID in ascending order, which is the same as sorting by date:

python utils/sort_by_id.py tweets.jsonl > tweets_sorted.jsonl

Another option is to create a csv, open in Excel and sort the “id” column by ascending and descending order to identify the first and last tweets in your dataset.

Remove Tweets Before Certain Date

You may want to study tweets from only a select number of days in your dataset. For example, you’ve searched a hashtag related to an event that occured within the last two days, but your search query retrieves tweets from the previous seven days.

Use this command to remove tweets occurring before a given date:

python utils\filter_date.py --mindate 13-oct-2019 tweets.jsonl > filtered.jsonl

Remove Duplicate Tweets

Creates a new JSONL file with no duplicate tweets. This script removes any tweets with duplicate IDs. Use the following command:

python utils/deduplicate.py tweets.jsonl > tweets_deduped.jsonl