Trying to figure out how to break up a text file
Clash Royale CLAN TAG#URR8PPP
Trying to figure out how to break up a text file
I have a lengthy series of text files from a project I did about 20 years ago (had to import them from floppy disks!). The original software used FORTRAN and could read the files directly, but I would like to do more efficient manipulations in R. When I read the file into R, you get something along the lines of what you would get if you created the following dataframe:
dataset <-
as.data.frame(c("R4 8561 200 365801HARLAN 16161616116616166116",
"R5 8533 100 472801WHITE 11611111111111111111",
"R4 8573 100 485101MCKENNA 11611161161111611161",
"R6 8513 200 489801HOLMES 66116111611161111161",
"R4 8522 200 492201DAY 11111611111111116111",
"R6 8548 100 500901LURTON 11116111911161111111",
"R5 8547 100 507322HUGHES 16611111111161116611",
"R4 85 3 100 518001VANDEVANTER99999911111111111111",
"R5 8553 100 521301LAMAR 99999911111111111111",
1910))
This should start out as a 10 x 1 dataframe. I am pulling my hair out trying to do the following:
(1) drop the last row of the dataset, regardless of how long the dataset is. When I do something like dataset <- dataset[-nrow(dataset),] it turns the frame into a factor for some reason; then
(2) drop everything in each cell before the names. The names always begin 21 characters in;
(3) Once I have that, I would like to separate the names (which are always 11 characters long, including spaces if needed) from the numbers (which represent a series of votes);
(4) Once I have that, slice the numbers into individual cells (which will always be 1, 6, or 9). The length of the number will vary from file-to-file.
Any help is greatly appreciated.
data.frame
as.data.frame
2) Your
dataset
is already a factor, [-nrow, ]
isn't changing it. Instead first convert to as.character
– Mako212
Aug 12 at 23:07
dataset
[-nrow, ]
as.character
If I may ask: why does R take the dataframe and convert it to a string if I try and remove the last row before converting to characters? Thank you again.
– St4096
Aug 12 at 23:27
R also has a
read.fortran
function - see here for an old example question if you know the FORTRAN input parameters - stackoverflow.com/questions/28490171/… Other than that, this should be approachable as a read.fwf
call if you specify all the widths of the columns - stackoverflow.com/a/41819624/496803– thelatemail
Aug 13 at 0:12
read.fortran
read.fwf
3 Answers
3
I would strongly advise to use read.fwf
, which allows you to read in files with a fixed column width format.
read.fwf
Consider your data consisting of the following lines
lines
lines <-
"R4 8561 200 365801HARLAN 16161616116616166116s
R5 8533 100 472801WHITE 11611111111111111111s
R4 8573 100 485101MCKENNA 11611161161111611161s
R6 8513 200 489801HOLMES 66116111611161111161s
R4 8522 200 492201DAY 11111611111111116111s
R6 8548 100 500901LURTON 11116111911161111111s
R5 8547 100 507322HUGHES 16611111111161116611s
R4 85 3 100 518001VANDEVANTER99999911111111111111s
R5 8553 100 521301LAMAR 99999911111111111111s
1910"
Then we read in the data with a fixed column width format as specified in function argument width
(the first column has width 20, the second width 11, and so on).
width
df <- read.fwf(textConnection(lines), width = c(20, 11, 1, 6, 9))
# Remove first column and last line
df[-nrow(df), -1]
# V2 V3 V4 V5
#1 HARLAN 1 616161 611661616
#2 WHITE 1 161111 111111111
#3 MCKENNA 1 161116 116111161
#4 HOLMES 6 611611 161116111
#5 DAY 1 111161 111111111
#6 LURTON 1 111611 191116111
#7 HUGHES 1 661111 111116111
#8 VANDEVANTER 9 999991 111111111
#9 LAMAR 9 999991 111111111
Some further comments:
read.fwf
file
read.fwf(filename, width = ...)
trimws
dataset <- data.frame(
test = c("R4 8561 200 365801HARLAN 16161616116616166116",
"R5 8533 100 472801WHITE 11611111111111111111",
"R4 8573 100 485101MCKENNA 11611161161111611161",
"R6 8513 200 489801HOLMES 66116111611161111161",
"R4 8522 200 492201DAY 11111611111111116111",
"R6 8548 100 500901LURTON 11116111911161111111",
"R5 8547 100 507322HUGHES 16611111111161116611",
"R4 85 3 100 518001VANDEVANTER99999911111111111111",
"R5 8553 100 521301LAMAR 99999911111111111111",
1910))
dataset <- dataset[-nrow(dataset), ]
Convert to character:
dataset$test <- as.character(dataset$test)
Strip first 20 characters:
dataset$new <- substr(dataset$test,21,100000)
Get names:
dataset$names <- gsub("^([A-Z]+).*", "\1", dataset$new)
Get numbers:
dataset$numbers <- gsub(".*?(\d+)$", "\1", dataset$new)
There should be an answered question on how to split the numbers into columns, see here
dataset
test new names numbers
1 R4 8561 200 365801HARLAN 16161616116616166116 HARLAN 16161616116616166116 HARLAN 16161616116616166116
2 R5 8533 100 472801WHITE 11611111111111111111 WHITE 11611111111111111111 WHITE 11611111111111111111
3 R4 8573 100 485101MCKENNA 11611161161111611161 MCKENNA 11611161161111611161 MCKENNA 11611161161111611161
4 R6 8513 200 489801HOLMES 66116111611161111161 HOLMES 66116111611161111161 HOLMES 66116111611161111161
5 R4 8522 200 492201DAY 11111611111111116111 DAY 11111611111111116111 DAY 11111611111111116111
6 R6 8548 100 500901LURTON 11116111911161111111 LURTON 11116111911161111111 LURTON 11116111911161111111
7 R5 8547 100 507322HUGHES 16611111111161116611 HUGHES 16611111111161116611 HUGHES 16611111111161116611
8 R4 85 3 100 518001VANDEVANTER99999911111111111111 VANDEVANTER99999911111111111111 VANDEVANTER 99999911111111111111
9 R5 8553 100 521301LAMAR 99999911111111111111 LAMAR 99999911111111111111 LAMAR 99999911111111111111
Here is one of the approach : use of text functions: nchar - number of characters in string, substr- to split string with start and end character, and strsplit - to split figure into separate columns :
dataset <- as.data.frame(dataset[-nrow(dataset), ], stringsAsFactors=F)
colnames(dataset) <- "text"
drop20 <- function(x)substr(x, 21, nchar(x))
dataset <- as.data.frame(sapply(dataset, drop20), stringsAsFactors=F)
colnames(dataset) <- "text"
cellnamesplit <- function(x)substr(x, 1, 11)
cellvaluesplit <- function(x)substr(x, 12, nchar(x))
cellname <- as.data.frame(sapply(dataset, cellnamesplit), stringsAsFactors=F)
cellvalue <- as.data.frame(sapply(dataset, cellvaluesplit), stringsAsFactors=F)
splitvalues <- function(x)strsplit(x, "")
valcellall <- t(as.data.frame(apply(cellvalue, 1, splitvalues), stringsAsFactors=F))
final <- cbind(cellname, valcellall)
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
1) Just use
data.frame
notas.data.frame
for your sample data.– Mako212
Aug 12 at 23:05