Trying to figure out how to break up a text file

I have a lengthy series of text files from a project I did about 20 years ago (had to import them from floppy disks!). The original software used FORTRAN and could read the files directly, but I would like to do more efficient manipulations in R. When I read the file into R, you get something along the lines of what you would get if you created the following dataframe:

dataset <- as.data.frame(c("R4 8561 200 365801HARLAN 16161616116616166116", "R5 8533 100 472801WHITE 11611111111111111111", "R4 8573 100 485101MCKENNA 11611161161111611161", "R6 8513 200 489801HOLMES 66116111611161111161", "R4 8522 200 492201DAY 11111611111111116111", "R6 8548 100 500901LURTON 11116111911161111111", "R5 8547 100 507322HUGHES 16611111111161116611", "R4 85 3 100 518001VANDEVANTER99999911111111111111", "R5 8553 100 521301LAMAR 99999911111111111111", 1910))

This should start out as a 10 x 1 dataframe. I am pulling my hair out trying to do the following:

(1) drop the last row of the dataset, regardless of how long the dataset is. When I do something like dataset <- dataset[-nrow(dataset),] it turns the frame into a factor for some reason; then

(2) drop everything in each cell before the names. The names always begin 21 characters in;

(3) Once I have that, I would like to separate the names (which are always 11 characters long, including spaces if needed) from the numbers (which represent a series of votes);

(4) Once I have that, slice the numbers into individual cells (which will always be 1, 6, or 9). The length of the number will vary from file-to-file.

Any help is greatly appreciated.

1) Just use data.frame not as.data.frame for your sample data.
– Mako212
Aug 12 at 23:05

data.frame

as.data.frame

2) Your dataset is already a factor, [-nrow, ] isn't changing it. Instead first convert to as.character
– Mako212
Aug 12 at 23:07

dataset

[-nrow, ]

as.character

If I may ask: why does R take the dataframe and convert it to a string if I try and remove the last row before converting to characters? Thank you again.
– St4096
Aug 12 at 23:27

R also has a read.fortran function - see here for an old example question if you know the FORTRAN input parameters - stackoverflow.com/questions/28490171/… Other than that, this should be approachable as a read.fwf call if you specify all the widths of the columns - stackoverflow.com/a/41819624/496803
– thelatemail
Aug 13 at 0:12

read.fortran

read.fwf

3 Answers
3

I would strongly advise to use read.fwf, which allows you to read in files with a fixed column width format.

read.fwf

Consider your data consisting of the following lines

lines

lines <- "R4 8561 200 365801HARLAN 16161616116616166116s R5 8533 100 472801WHITE 11611111111111111111s R4 8573 100 485101MCKENNA 11611161161111611161s R6 8513 200 489801HOLMES 66116111611161111161s R4 8522 200 492201DAY 11111611111111116111s R6 8548 100 500901LURTON 11116111911161111111s R5 8547 100 507322HUGHES 16611111111161116611s R4 85 3 100 518001VANDEVANTER99999911111111111111s R5 8553 100 521301LAMAR 99999911111111111111s 1910"

Then we read in the data with a fixed column width format as specified in function argument width (the first column has width 20, the second width 11, and so on).

width

df <- read.fwf(textConnection(lines), width = c(20, 11, 1, 6, 9)) # Remove first column and last line df[-nrow(df), -1] # V2 V3 V4 V5 #1 HARLAN 1 616161 611661616 #2 WHITE 1 161111 111111111 #3 MCKENNA 1 161116 116111161 #4 HOLMES 6 611611 161116111 #5 DAY 1 111161 111111111 #6 LURTON 1 111611 191116111 #7 HUGHES 1 661111 111116111 #8 VANDEVANTER 9 999991 111111111 #9 LAMAR 9 999991 111111111

Some further comments:

read.fwf

file

read.fwf(filename, width = ...)

trimws

dataset <- data.frame( test = c("R4 8561 200 365801HARLAN 16161616116616166116", "R5 8533 100 472801WHITE 11611111111111111111", "R4 8573 100 485101MCKENNA 11611161161111611161", "R6 8513 200 489801HOLMES 66116111611161111161", "R4 8522 200 492201DAY 11111611111111116111", "R6 8548 100 500901LURTON 11116111911161111111", "R5 8547 100 507322HUGHES 16611111111161116611", "R4 85 3 100 518001VANDEVANTER99999911111111111111", "R5 8553 100 521301LAMAR 99999911111111111111", 1910)) dataset <- dataset[-nrow(dataset), ]

Convert to character:

dataset$test <- as.character(dataset$test)

Strip first 20 characters:

dataset$new <- substr(dataset$test,21,100000)

Get names:

dataset$names <- gsub("^([A-Z]+).*", "\1", dataset$new)

Get numbers:

dataset$numbers <- gsub(".*?(\d+)$", "\1", dataset$new)

There should be an answered question on how to split the numbers into columns, see here

dataset test new names numbers 1 R4 8561 200 365801HARLAN 16161616116616166116 HARLAN 16161616116616166116 HARLAN 16161616116616166116 2 R5 8533 100 472801WHITE 11611111111111111111 WHITE 11611111111111111111 WHITE 11611111111111111111 3 R4 8573 100 485101MCKENNA 11611161161111611161 MCKENNA 11611161161111611161 MCKENNA 11611161161111611161 4 R6 8513 200 489801HOLMES 66116111611161111161 HOLMES 66116111611161111161 HOLMES 66116111611161111161 5 R4 8522 200 492201DAY 11111611111111116111 DAY 11111611111111116111 DAY 11111611111111116111 6 R6 8548 100 500901LURTON 11116111911161111111 LURTON 11116111911161111111 LURTON 11116111911161111111 7 R5 8547 100 507322HUGHES 16611111111161116611 HUGHES 16611111111161116611 HUGHES 16611111111161116611 8 R4 85 3 100 518001VANDEVANTER99999911111111111111 VANDEVANTER99999911111111111111 VANDEVANTER 99999911111111111111 9 R5 8553 100 521301LAMAR 99999911111111111111 LAMAR 99999911111111111111 LAMAR 99999911111111111111

Here is one of the approach : use of text functions: nchar - number of characters in string, substr- to split string with start and end character, and strsplit - to split figure into separate columns :

dataset <- as.data.frame(dataset[-nrow(dataset), ], stringsAsFactors=F) colnames(dataset) <- "text" drop20 <- function(x)substr(x, 21, nchar(x)) dataset <- as.data.frame(sapply(dataset, drop20), stringsAsFactors=F) colnames(dataset) <- "text" cellnamesplit <- function(x)substr(x, 1, 11) cellvaluesplit <- function(x)substr(x, 12, nchar(x)) cellname <- as.data.frame(sapply(dataset, cellnamesplit), stringsAsFactors=F) cellvalue <- as.data.frame(sapply(dataset, cellvaluesplit), stringsAsFactors=F) splitvalues <- function(x)strsplit(x, "") valcellall <- t(as.data.frame(apply(cellvalue, 1, splitvalues), stringsAsFactors=F)) final <- cbind(cellname, valcellall)

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy

Trying to figure out how to break up a text file

Trying to figure out how to break up a text file

3 Answers
3

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

Trying to figure out how to break up a text file

Trying to figure out how to break up a text file

3 Answers 3

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

3 Answers
3